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Preface 



On the 23rd of April, 2001, the 6th Workshop on High-Level Parallel Pro- 
gramming Models and Supportive Environments (LCTES’98) was held in San 
Francisco. HIPS has been held over the past six years in conjunction with IPDPS, 
the Internation Parallel and Distributed Processing Symposium. 

The HIPS workshop focuses on high-level programming of networks of work- 
stations, computing clusters and of massively-parallel machines. Its goal is to 
bring together researchers working in the areas of applications, language design, 
compilers, system architecture and programming tools to discuss new develop- 
ments in programming such systems. 

In recent years, several standards have emerged with an increasing demand 
of support for parallel and distributed processing. On one end, message-passing 
frameworks, such as PVM, MPI and VIA, provide support for basic communi- 
cation. On the other hand, distributed object standards, such as CORBA and 
DCOM, provide support for handling remote objects in a client-server fashion 
but also ensure certain guarantees for the quality of services. 

The key issues for the success of programming parallel and distributed envi- 
ronments are high-level programming concepts and efficiency. In addition, other 
quality categories have to be taken into account, such as scalability, security, 
bandwidth guarantees and fault tolerance, just to name a few. 

Today’s challenge is to provide high-level programming concepts without sac- 
rificing efficiency. This is only possible by carefully designing for those concepts 
and by providing supportive programming environments that facilitate program 
development and tuning. 

Past results in parallel computing on one side and distributed systems on the 
other side present opportunities for an increased transfer of knowledge between 
the areas. In particular, cluster computing presents a promising framework for 
parallel computing where advances from distributed systems can be utilized. 
Achievements in the area of automated performance analysis and performance 
modeling for parallel systems, on the other hand, may contributed to advances 
in performance analysis of distributed systems, as well. 

Future directions also include alternatives to current standardization prac- 
tices, for example, by replacing client-server protocols with decentralized ones 
that may be more suitable for distributed systems. On the other hand, suc- 
cessful programming models, such as the shared-memory paradigm, should be 
investigated for new trends like cluster computing. 

This workshop provides a forum for researchers and commercial developers 
to meet and discuss the various hardware and software issues involved in the 
design and use of high-level programming models and supportive environments. 
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HIPS’Ol featured an invited talk and presentations of ten refereed papers. 
The papers were selected out of 20 submissions. 21 referees prepared multiple 
reviews for each submission. The recommendations of the referees determined an 
initial set of papers selected for acceptance. The members of the program com- 
mittee were then given the option to resolve differences in opinion for the papers 
they refereed based on the written reviews. After the resolution process, the 10 
papers included in the workshop proceedings were selected for presentation and 
publication. The papers cover the topics of: 

— Concepts and languages for high-level parallel programming. 

— Concurrent object-oriented programming. 

— Distributed objects and components. 

— Structured parallel programming (skeletons, patterns, etc.). 

— Software engineering principles for parallel system. 

~ Automatic parallelization and optimization. 

— High-level programming environments. 

— Automated performance analysis and performance modeling. 

— Debugging techniques and development tools. 

— Distributed shared memory. 

— Implementation techniques for high-level programming models. 

— Operating system support for runtime systems and middleware. 

~ Architectural support for high-level programming models. 

— Guarantees for Quality of Service in distributed environments. 

— Security of communication for distributed execution. 

— Fault tolerance in network computing. 
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Abstract. Clusters of shared-memory multiprocessors (SMPs) have be- 
come the most promising parallel computing platforms for scientific com- 
puting. However, SMP clusters significantly increase the complexity of 
user application development when using the low-level application pro- 
gramming interfaces MPI and OpenMP, forcing users to deal with both 
distributed-memory and shared-memory parallelization details. In this 
paper we present extensions of High Performance Fortran for SMP clus- 
ters which enable the compiler to adopt a hybrid parallelization strat- 
egy, efficiently combining distributed-memory with shared-memory par- 
allelism. By means of a small set of new language features, the hier- 
archical structure of SMP clusters may be specified. This information 
is utilized by the compiler to derive inter-node data mappings for con- 
trolling distributed-memory parallelization across the nodes of a cluster, 
and intra-node data mappings for extracting shared-memory parallelism 
within nodes. Additional mechanisms are proposed for specifying inter- 
and intra-node data mappings explicitly, for controlling specific SM par- 
allelization issues, and for integrating OpenMP routines in HPF appli- 
cations. The proposed features are being realized within the ADAPTOR 
and VFC compiler. The parallelization strategy for clusters of SMPs 
adopted by these compilers is discussed as well as a hybrid-parallel ex- 
ecution model based on a combination of MPI and OpenMP. Early ex- 
perimental results indicate the effectiveness of the proposed features. 
Keywords: Parallel programming, HPF, OpenMP, MPI, SMP clusters, 
parallelization, hybrid parallelism. 



1 Introduction 

Many supercomputer centers have introduced clusters of shared-memory mul- 
tiprocessors (SMPs) as their premier high performance computing platforms. 
Examples of such systems are multiprocessor clusters from SUN, SGI, IBM, a 
variety of multi-processor PC clusters, supercomputers like the NEC SX-5 or the 
future Japanese Earth Simulator and the ASCI White machine. 

* This work was supported by NEC Europe Ltd. as part of the ADVICE and AD- 
VANCE projects in cooperation with the NEC C&C Research Laboratories and by 
the Special Research Program SEB FOll “AURORA” of the Austrian Science Fund. 

F. Mueller (Ed.): HIPS 2001, LNCS 2026, pp. l-Q 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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Clusters of SMP systems increase the complexity of user applications devel- 
opment by forcing programmers to deal with shared-memory programming issues 
such as multi-threading and synchronization, as well as with distributed-memory 
issues such as data distribution and explicit message-passing. 

There are mainly two trends in parallel programming, depending on how 
the address space of parallel systems is organized. On the one hand, the stan- 
dard application programming interface (API) for message-passing, MPI is 
widely used, mainly for distributed-memory systems. Recently, a standard API 
for shared-memory parallel programming, OpenMP o, has become available. 
While OpenMP is restricted to shared-memory architectures only, MPI programs 
can also be executed on shared-memory machines and clusters. 

However, MPI programs which are executed on clusters of SMPs usually 
do not directly utilize the shared-memory available within nodes and thus may 
miss a number of optimization opportunities. A promising approach for parallel 
programming attempts to combine MPI and OpenMP in a single application. 
Such a strategy attempts to fully exploit the potential of SMP clusters by relying 
on data distribution and explicit message-passing between the nodes of a cluster, 
and on shared-memory and multi-threading within the nodes. While such an 
approach allows optimizing parallel programs by taking the hybrid architecture 
of SMP clusters into account, applications written in such a way tend to become 
extremely complex. 

In contrast to MPI and OpenMP, High Performance Fortran (HPF) is a high- 
level parallel programming language which can be employed on both distributed- 
memory and shared-memory machines. The data mapping provides the necessary 
data locality to minimize communication and synchronization that is generated 
automatically by the compiler. Although HPF programs can be compiled for 
clusters of SMPs, the language does not provide features for exploiting the hier- 
archical structure of clusters. As a consequence, current HPF compilers usually 
ignore the shared-memory aspect of SMP clusters and treat such machines as 
pure distributed-memory systems. 

In order to overcome these shortcomings of HPF and its compilers, we pro- 
pose extensions of the HPF mapping mechanisms such that the hierarchical 
structure of SMP clusters can be taken into account. Based on these features, 
an HPF compiler can then adopt a hybrid parallelization strategy whereby 
distributed-memory parallelism based on message-passing, e.g. MPI, is exploited 
across the nodes of a cluster, while shared-memory parallelism is exploited within 
SMP nodes by relying on multi-threading, e.g. OpenMP. 

Two sets of HPF extensions for clusters of SMPs are proposed in this pa- 
per. The first set of extensions is based on the concept of processor mappings, 
a simple mechanism for specifying the hierarchical structure of SMP clusters by 
introducing abstract node arrays onto which abstract processor arrays may be 
mapped by means of the basic HPF distribution mechanisms. The second set of 
extensions allows the specification of hierarchical data mappings in two separate 
steps, comprising an inter-node data mapping which maps data arrays onto ab- 
stract node arrays and an intra-node data mapping which maps node-local data 
onto processors within a node. Additional language mechanisms are provided for 
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the integration of OpenMP routines or other node routines that operate only on 
data local to a node. 

The rest of this paper is organized as follows. Section Q proposes language 
features for optimizing existing HPF programs for clusters of SMPs by provid- 
ing an explicit specification of the hierarchical structure of clusters. Section Q 
sketches a hierarchical compilation and execution model for clusters of SMPs 
which is adopted for the parallelization of HPF programs that utilize the pro- 
posed language extensions. Section 0 describes a set of additional HPF exten- 
sions for specifying inter-node and intra-node data mappings explicitly, and a 
new EXTRINSIC model for integrating OpenMP routines into HPF applications. 
Related work is discussed in Section 0 Finally, conclusions and a discussion of 
future work are presented in Section 0 

2 Exploiting the Hierarchical Structure of SMP Clusters 

HPF provides the concept of abstract processor arrangements to establish an ab- 
straction of the underlying parallel target architecture in the form of one or more 
rectilinear processor arrays. Processor arrays are utilized within data distribu- 
tion directives to describe a mapping of array elements to abstract processors. 
Array elements mapped to an abstract processor are owned by that processor. 
Ownership of data is the central concept for the execution of data parallel pro- 
grams. Based on the ownership of data, the distribution of computations to 
abstract processors and the necessary communication and synchronization are 
derived automatically. 

Consider now an SMP cluster consisting of NN nodes, each equipped with 
NPN processors. Currently, if an HPF program is targeted to an SMP cluster, 
abstract processors are either associated with the NN nodes of the cluster or 
with the NN * NPN processors. In the first case, data arrays are distributed 
only across the NN nodes of the cluster and therefore only parallelism of degree 
NN can be exploited. In the second case, where abstract HPF processors are 
associated with all processors available in the cluster, potential parallelism of 
degree NN * NPN can be exploited. However, by viewing an SMP cluster as a 
distributed-memory machine consisting of NN * NPN processors, the shared- 
memory available within nodes is usually not exploited, since data distribution 
and communication are performed within nodes as well. 

In the following we propose language features for optimizing existing HPF 
programs for clusters of SMPs by providing an explicit specification of the hier- 
archical structure of clusters. 

2.1 Processor Mappings 

The concept of proeessor mappings is introduced in order to describe the hier- 
archical structure of SMP clusters. A processor mapping specifies a mapping of 
an abstract processor array to an abstract node array. The NODES directive is 
introduced for declaring one or more abstract node arrays. For the specification 
of processor mappings a subset of the HPF data distribution mechanisms as 
provided by the DISTRIBUTE directive are utilized. 
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Fig. 1. Processor mappings. The hierarchical structure of SMP clusters is de- 
scribed by mapping processor arrays to node arrays using a subset of HPF’s 
distribution mechanisms. 



As a consequence of a processor mapping, each processor of an abstract 
processor array is mapped to a node of an abstract node array. Processor map- 
pings are specified by using a subset of the HPF distribution mechanisms. For 
homogeneous clusters, the usual HPF BLOCK distribution format may be used. 
Heterogeneous clusters can be supported by means of the GEN_BLQCK distribution 
format of the Approved Extensions of HPF . 



OMSt^HS) 

!hpf$ processors P(8) 

!hpf$ nodes N(4) 

!hpf$ distribute P (block) onto N 




!hpf$ processors R(4,8) 

!hpf$ nodes M(4) 

!hpf$ distribute R(*, block) onto M 



real A(NA) 

!hpf$ distribute A (block) onto P 



real B(NB1,NB2) 

!hpf$ distribute B(cyclic, block) onto R 



(a) 



(b) 



Fig. 2. Examples of processor mappings. On the left-hand side Figure (a) shows 
a 4x2 SMP cluster, while on the right-hand side Figure (b) illustrates a cluster 
of 4 nodes each with 8 processors, arranged in a 4x2 configuration. 



Figure n illustrates the concept of processor mappings while examples of 
processor mappings are shown in Figures O and Q Figure O shows how the 
hierarchical structure of an SMP cluster, consisting of 4 nodes with 2 processors 
each, may be specified. Figure Q(b) specifies an SMP cluster, consisting of 4 
nodes with 8 processors each, where the processors within a node are arranged 
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as a 4x2 array. FigureQ specifies a heterogeneous SMP cluster, consisting of 4 
nodes with 2, 3, 4, and 3 processors, respectively. Here the GEN_BL0CK distribution 
format of HPF is utilized to indicate that the number of processors within nodes 
varies. 




integer, dimension (4) : : SIZE = (/2,3,4,3/) 
!hpf$ processors R(12) 

!hpf$ nodes M(4) 

!hpf$ distribute R(gen_block(SIZE) ) onto M 



Fig. 3. A heterogeneous SMP cluster with four nodes. The GEN_BL0CK distribu- 
tion format is used to specify the number of processors available in each node. 



In order to support abstract node arrays whose sizes are determined upon 
start of a program, the new intrinsic function NUMBER_DF_NODES is provided. 
NUMBER_0F_N0DES returns the actual number of nodes used to execute a program 
in the same way as the HPF intrinsic function NUMBER_OF_PROCESSDRS returns 
the total number of processors used to execute an HPF program. 

Processor mappings provide a simple means for optimizing HPF applica- 
tions for SMP clusters. Using processor mappings, the hierarchical structure of 
SMP clusters may be specified explicitly, without the need to change existing 
HPF directives. Based on a processor mapping, an HPF compiler can adopt a 
cluster-specific parallelization strategy in order to exploit distributed-memory 
parallelism across the nodes of a cluster, and shared-memory parallelism within 
nodes. 

2.2 Exploiting DM and SM Parallelism 

If a dimension of an abstract processor array is distributed by the format BLOCK 
or GEN_BL0CK, contiguous blocks of processors are mapped to the nodes in the 
corresponding dimension of the specified abstract node array. As a consequence 
of such a processor mapping, both distributed-memory parallelism and shared- 
memory parallelism may be exploited for all data array dimensions that are 
mapped to that processor array dimension. On the other hand, if in a processor 
mapping a dimension of an abstract processor array is distributed by means of 
all abstract processors in that dimension are mapped to the same node 
of an abstract node array, and thus only shared-memory parallelism may be 
exploited across array dimensions which have been mapped to that processor 
array dimension. 

In the first example (Figure 0 (a)), both distributed- and shared-memory 
parallelism may be exploited for array A. In the other example (Figure 0 (b)), 
only shared-memory parallelism may be exploited across the first dimension of 
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array B, while both shared-memory and distributed-memory parallelism may be 
exploited across the second dimension of B. 

2.3 Deriving Inter- and Intra-Node Mappings 

An HPF data mapping directive, e.g. DISTRIBUTE A (BLOCK) ONTO P, deter- 
mines for each processor P(I) of the abstract processor array P those parts 
of the data array A that are owned by P (I) . If, in addition, a processor mapping 
for P with respect to a node array N is specified, an inter-node data mapping may 
be automatically derived from the data mapping and the processor mapping. 

An inter-node data mapping determines for each node N(J) those parts of 
array A that are owned by node N(J). The implicit assumption is that those 
portions of an array owned by a node are allocated in an unpartitioned way 
in the shared memory of this node, while data is distributed across the local 
memory of nodes, according to inter-node mapping. 

In the same way, an intra-node data mapping may be derived from a data 
mapping and a processor mapping. Intra-node data mappings specify a mapping 
of the data allocated on a node of a cluster with respect to the processors within 
a node. Here, we assume that intra-node data mappings are utilized by the 
compiler to control the exploitation of shared-memory parallelism within SMP 
nodes. This is described in more detail in the next section. 

3 Compilation and Execution Model 

In this section we briefly sketch a hierarchical compilation and execution model 
for clusters of SMPs which is adopted for the parallelization of HPF programs 
that utilize the proposed language extensions. This model is currently being real- 
ized within the ADAPTOR HPF compilation system Q and within the VFC Q 
compiler. 

3.1 Outline of the Compilation Strategy 

Both the ADAPTOR and the VFC compiler are source-to-source translation 
systems which transform an HPF program into an explicitly parallel Fortran 
program which is then compiled by the Fortran compiler of the parallel tar- 
get machine in order to obtain an executable parallel program. As opposed to 
the usual HPF compilation, where a single-threaded SPMD node program is 
generated, a multi-threaded node program is generated under the hierarchical 
execution model. 

An extended HPF program is compiled in two phases for clusters of SMPs. In 
the first compilation phase the compiler analyzes the data mappings and proces- 
sor mappings in order to determine an inter-node data mapping. Based on the 
inter-node data mapping, the compiler distributes data and work across nodes 
and inserts message-passing primitives in order to realize the communication of 
non-local data between the nodes of a cluster. The result of the first compilation 
phase is an SPMD Fortran/MPI message-passing program. 
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In Figure 0(b) the first compilation phase is illustrated for a simple HPF 
code fragment showing a reduction operation on a two-dimensional array.Due to 
the specified processor mapping, the first dimension of B is classified as shared 
while the second dimension is classified as distributed. As a consequence, the 
second dimension of B is distributed across the 4 nodes, while the first dimension 
is not distributed. The SUM intrinsic function is transformed into a nested loop. 
The outer loop is strip-mined across the nodes of the cluster. 

The call to HPFrt_DM_get_my .bounds ensures that each node operates only 
on its local part of B. After the loop, communication between nodes is performed 
by means of mpi.allreduce in order to combine the 

partial results of all nodes. 

In the second phase the intermediate SPMD program is parallelized for 
shared-memory according to the intra-node data mapping derived by the com- 
piler. Shared- memory parallelization is currently realized by inserting corre- 
sponding OpenMP directives in order to distribute the work of a node among 
multiple threads. Work distribution of loops and array assignments is derived 
from the intra-node data mapping of the accessed arrays and realized by cor- 
responding OpenMP work-sharing constructs and/or appropriate loop transfor- 
mations (e.g. strip-mining). Similarly, data consistency of shared data objects 
is enforced by inserting appropriate OpenMP synchronization primitives. As 
shown in Figure 0(c) during the second compilation phase the SPMD message- 
passing program is transformed by inserting OpenMP directives in order to ex- 
ploit shared-memory parallelism within the nodes of the cluster. The OpenMP 
parallel directive ensures that multiple threads are generated on each node, 
and the call to HPFrt_SM_get_my .bounds enforces that each thread executes its 
own chunk of inner loop iterations. 



3.2 The Hierarchical Execution Model 

HPF programs compiled for clusters of SMPs as outlined above, are executed 
according to a hierarchical execution model. Within the hierarchical execution 
model an HPF program is executed by a set of parallel processes, each of which 
executes on a separate node of an SMP cluster within its own local address space. 
Usually, every abstract HPF node is associated with a separate MPI-process 
executed on a single node of the cluster. Process parallelism, data partitioning, 
and message-passing communication based on MPI is utilized across nodes. 

Each node process generates a set of threads which emulate the abstract 
processors mapped to a node and which execute concurrently in the shared 
address space of a node. The data mapped to one node is allocated in a non- 
partitioned way in the shared memory, regardless of the intra-node mapping. 
Parallel execution of threads within nodes is organized on the basis of the derived 
intra-node data mapping which controls the distribution of computations among 
the threads. Consistency of shared data objects is guaranteed by automatically 
generated synchronization primitives 0 . 
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!hpf$ processors P(6,4) 

!hpf$ nodes N(4) 

!hpf$ distribute P(*, block) onto N 
real B(NB1,NB2), S 

!hpf$ distribute B(block, block) onto P 
S = sum(B) 



(a) Original HPF program. 



real B (NBl ,NB2/4) , S ! for simplicity: NB2 is a multiple of 4 
integer B_DSP (...) ! internal array descriptor for B 

integer B_LB2, B_UB2 

... ! build array descriptor 

S = 0.0 

call HPFrt_DM_get_my_bounds (B_DSP, 2, B_LB2 , B_UB2) 
do I = B_LB2, B_UB2 
do J = 1, NBl 

S = S + B(J,I) 
end do 
end do 

call mpi_allreduce (S, S, 1, MPI_REAL, MPI_SUM, MPI_CQMM_W0RLD , lERR) 

(b) Phase 1: Intermediate SPMD Fortran/MPI message-passing node program. 



integer B_LB1, B_UB1, B_LB2, B_UB2 

... ! build array descriptor 

S = 0.0 

call HPFrt_DM_get_my .bounds (B_DSP, 2, B_LB2, B_UB2) 

! $omp parallel , private (I , J ,B_LB1 ,B_UB1) , reduction(S) 

call HPFrt_SM_get_my .bounds (B.DSP, 1, B.LBl, B.UBl) 
do I = B.LB2, B.UB2 
do J = B.LBl, B.UBl 
S = S + B(J,I) 
end do 
end do 

! $omp end parallel 

call mpi.allreduce (S, S, 1, MPI.REAL, MPI.SUM, MPI.COMM.WORLD , lERR) 

(c) Phase 2: Generated hybrid-parallel MPI/OpenMP program. 



Fig. 4. Compilation of a HPF program for clusters of SMPs. 
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Fig. 5. Hierarchical HPF execution model. 



3.3 Experiments 

In this section we report on early experiments with the proposed language exten- 
sions and the hierarchical compilation and execution model on a dual-processor 
PC cluster. For the performance experiments, we used an existing HPF kernel 
from a numerical pricing module developed in the context of the AURORA 
Financial Management System Q. This kernel realizes a backward induction 
algorithm on a Hull and White interest rate tree. In the HPF code the inter- 
est rate tree is represented by 2-dimensional arrays and several index vectors 
which capture the structure of the tree. All these arrays have been distributed 
by usual HPF distribution mechanisms. The main computations are performed 
in an inner INDEPENDENT loop with indirect data accesses, which operates on a 
single level of the Hull and White tree. Due to the peculiarities of the algorithm, 
communication is required for each level of the tree, introducing a significant 
communication overhead. 

For the performance experiments, we computed the price of a variable coupon 
bond with a maturity of 10 years on a one-day basis, resulting in a tree with 3650 
levels. We compared the original HPF kernel to an extended kernel, where an 
additional processor mapping was used for specifying the hierarchical structure 
of the PC cluster. Both kernels have been parallelized with the VFC compiler Q 
and executed on a Beowulf cluster consisting of 8 nodes, each equipped with 
two Pentium II (400MHz) processors, connected by Fast Ethernet. The pgf90 
compiler from Portland Group Inc. was used as a back-end compiler of VFC for 
compiling the generated MPI/OpenMP program. 
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The original HPF kernel was compiled by VFC to Fortran/MPI, while the 
extended kernel was compiled to Fortran/MPI/OpenMP and executed according 
to the hierarchical execution model utilizing MPI process parallelism across the 
nodes of the cluster and OpenMP thread parallelism within nodes. 

The original HPF kernel has been measured in two different ways which are 
labeled in Figurefl HPF/MPI-P and HPF/MPI-N, respectively. HPF/MPI-P refers 
to the MPI code generated by VFC and executed on 2 to 16 processors, where 
both processors were utilized on each node. HPF/MPI-N refers to the MPI version 
of the generated code executed on 2 to 8 nodes, where only one processor was 
utilized on each node. HPF-Cluster/MPI-OpenMP refers to the hybrid parallel 
code generated by VFC from the extended HPF kernel. 

As Figure Q shows, the HPF-Cluster kernel significantly outperforms the 
original HPF kernel, if both processors in a node are used. One reason for this 
performance difference seems to be the communication overhead induced by 
MPI communicatiorfl within the nodes of the cluster. The times measured for 
the HPF-Cluster/MPI-OpenMP and HPF/MPI-N variants were almost the same, 
however, the HPF/MPI-N version required twice as many nodes. 




Fig. 6. Experimental Results. 



^ The MPI version used on the PC cluster did not offer an optimized intra-node 
communication via shmem. 
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4 Additional Language Features 

As outlined in the previous section, the introduction of abstract node arrays and 
processor mappings allows the specification of the hierarchical structure of SMP 
clusters. Based on these features, an HPF compiler may derive from existing 
HPF distribution an inter-node mapping which is utilized to distribute data 
across the nodes of a cluster, and an intra-node mapping in order to exploit 
shared-memory parallelism within nodes. 

In this Section we describe a set of additional HPF extensions which to enable 
programmers to specify inter-node data mappings and intra-node data mappings 
explicitly. Such mappings are referred to as hierarchical data mappings and are 
defined in two steps. First, arrays are distributed to the nodes of a cluster, and 
second, node-local data is mapped to the processors within a cluster. 



4.1 Hierarchical Data Mappings 

A hierarchical data mapping may be specified explicitly by the user by first 
mapping data arrays onto abstract node arrays, and then specifying a mapping 
of node-local data with respect to the processors within a node. The first mapping 
is referred to as an explicit inter-node data mapping, and the second mapping is 
called explicit intra-node data mapping. 

Inter-node data mappings may be specified explicitly by means of the HPF 
DISTRIBUTE directive which has been extended such that data arrays may be 
distributed directly onto abstract node arrays. The semantics of a distribution 
directive with an abstract nodes array as distribution target is equivalent to 
the semantics of a usual HPF distribution directive, except that data is now 
distributed only with respect to the nodes of a cluster. Processors within a node 
of a cluster usually have access to all data mapped to the shared memory of a 
node. 

In addition to an inter-node mapping, an intra-node mapping may be ex- 
plicitly specified by means of the SHARE directive. An intra-node data mapping 
specifies a mapping of node-local data to the abstract processors (threads) within 
a node. The syntax of the SHARE directive is similar to the DISTRIBUTE direc- 
tive, but does not include an ONTO clause. The target of the mapping is the set 
of abstract processors available within a node. 

The SHARE directive is used to control the distribution of work among the 
threads available on a node. In this context, the OpenMP work-sharing formats 
DYNAMIC and GUIDED may be employed in addition to the usual HPF distribution 
formats BLOCK, CYCLIC, and GEN_BL0CK. 



4.2 Controlling the Work Sharing of Loops 

An inter-node mapping explicitly specified by a SHARE directive for certain arrays 
is utilized in order to determine the work sharing of parallel loops or array 
assignments where such arrays are accessed. The user may override such an 
automatic strategy by appending a SHARE clause to an INDEPENDENT directive. 
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!hpf$ nodes N(8) 

real A(1000,1000) 

!hpf$ distribute A(block,*) onto N ! inter-node mapping (data distribution) 
!hpf$ share A(*,block) ! intra-node mapping (work sharing) 

!hpf$ independent, share (dynamic) 
do J = 1, N 

A(: ,J) = f (A(: ,J)) 
end do 



Fig. 7. Hierarchical data mappings. The first dimension of A is distributed by 
block onto the nodes of an SMP cluster, while for the second dimension of A a 
block work sharing strategy is specified by means of the share directive. Note 
that the work sharing information provided for an array in the specification part 
of a program may be overwritten at individual loops by attaching a share clause 
to an independent directive. 



An example of a hierarchical mapping is shown in Figure Q Assuming a 
homogeneous cluster of 8 nodes, each of which equipped with NPN processors, 
the inter-node mapping specified by means of the DISTRIBUTE directive dis- 
tributes the first dimension of array A across the 8 nodes of the cluster, mapping 
a sub-array of shape (125,1000) to each node. The SHARE directive specifies a 
virtual distribution within each node. As a consequence, a sub-array of shape 
(125,1000/NPN) is mapped to each processor within a node. The SHARE clause 
attached to the INDEPENDENT directive specifies that a dynamic scheduling strat- 
egy should be adopted for scheduling the loop iterations of the J-loop among 
the threads executing on the processors within nodes. Without the SHARE clause 
a BLOCK-scheduling strategy would be adopted according to the specified intra- 
node mapping. 

4.3 Integration of OpenMP Routines 

The LOCAL extrinsic model of HPF eg provides the possibility to write code 
that is executed on all active processors, where each processor has direct access 
only to its local data. We introduce a new extrinsic model, called N0DE_L0CAL, 
that allows writing code which is executed on all active nodes, where on each 
node only node-local data according to the inter-node mapping is directly ac- 
cessible. In NQDE_LOCAL routines the potential shared-memory parallelism within 
nodes can be exploited independently of the HPF level. 

A N0DE_L0CAL subprogram can be written in one of the following languages: 

— ’HPF’, referring to the HPF language, where the HPF compiler can take 
advantage of the fact that only processors on the same node are involved. 

— ’FORTRAN’ , referring to the ANSI/ISO standard Fortran language, in which 
the user can take advantage of thread parallelism, for example, via OpenMP. 
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extrinsic (M0DEL=N0DE_L0CAL) subroutine Sub (A) 
real, dimension :: A(:,:) 
integer N 

! $0MP parallel 

end subroutine 



Fig. 8. Example of a N0DE_L0CAL EXTRINSIC routine. 



Node-local HPF procedures can use any HPF intrinsic or library procedure. 
Similar to the HPF Local Routine Library, a new HPF Node Local Routine 
Library is provided for the use within HPF node-local routines, and a Fortran 
Node Local Routine Library which realizes the HPF Node Local Routines with 
a Fortran interface. 

Node-local Fortran routines are very similar to local Fortran routines. The 
only difference is that the programmer now sees in the subprogram only the 
data mapped to a node instead of the data mapped to a processor. It is the pro- 
grammers responsibility to exploit parallelism across multiple processors within 
a node, e.g. via OpenMP or via any other thread library. 

5 Related Work 

Several researchers have investigated the advantages of a hybrid programming 
model based on MPI and OpenMP against a unified MPI-only model. Cappelo Q 
et al. investigated a hybrid-parallel programming strategy in comparison with a 
pure message passing approach using the NAS benchmarks on IBM SP systems. 
In their experiments the MPI-only approach is better than a hybrid strategy 
for most codes. They conclude that a hybrid-parallel strategy becomes supe- 
rior when fast processors make the communication performance significant and 
the level of parallelization is sufficient. Henty Q reports on experiments with 
a Discrete Element Modeling code on various SMP clusters. He concludes that 
current OpenMP implementations are not yet efficient enough for hybrid par- 
allelism to outperform pure message-passing. Haan 0 performed experiments 
with a matrix-transpose showing that a hybrid-parallel approach can signifi- 
cantly outperform message-passing parallelization. 

On the 0rigin2000, the SGI data placement directives |[g form a vendor spe- 
cific extension of OpenMP. Some of these extensions have similar functionality as 
the HPF directives, e.g. “affinity scheduling” of parallel loops is the counterpart 
to the ON clause of HPF. Compaq has also added a new set of directives to its 
Fortran for Tru64 UNIX that extend the OpenMP Fortran API to control the 
placement of data in memory and the placement of computations that operate 
on that data Q. 

Chapman, Mehrotra and Zima Q propose a set of OpenMP extensions, sim- 
ilar to HPF mapping directives, for locality control, but they do not provide a 
execution model or implementation scheme. 
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Portland Group, Inc. proposes a high-level programming model [3 that ex- 
tends the OpenMP API with additional data mapping directives, library routines 
and environment variables. This model extends OpenMP in order to control data 
locality with respect to the nodes of SMP clusters. In contrast to this model, 
HPF, with the extensions proposed in this paper, supports locality control across 
nodes as well as within nodes. 

All these other approaches introduce data mapping features into OpenMP in 
order to increase locality, but still utilize the explicit work distribution via the 
PARALLEL and PARDO directives of OpenMP. Our approach is based on HPF and 
relies on an implicit work distribution which is usually derived from the data 
mapping but which may be explicitly controlled by the user within nodes by 
means of OpenMP-like extensions. 

A number of researches have addressed the issues of implementing OpenMP 
on clusters of SMPs relying on a distributed-shared memory (DSM) software 
layer. Hu et al. m describe the implementation of OpenMP on a network of 
shared-memory multiprocessors by means of a translating OpenMP directives 
into calls to a modified version of the TreadMarks software distributed memory 
system. Sato et al. Q3 describe the design of an OpenMP compiler for SMP 
clusters based on a compiler-directed DSM software layer. 



6 Conclusions and Future Work 

The concept of processor mappings enables HPF programmers to provide the 
compiler with an explicit description of the hierarchical structure of SMP clus- 
ters. The corresponding directives for the definition of abstract node arrays and 
the mapping of abstract processor arrays to abstract node arrays can also be con- 
sidered as implementation-dependent directives that go conform with the current 
HPF language definition and therefore enable an easy way to port existing HPF 
programs to SMP clusters. 

Additional extensions are provided for the explicit specification of inter-node 
and intra-node data mappings. These features give users more control over ex- 
ploiting shared-memory parallelism within a node, by using the SHARE directive 
and SHARE clause, or the new N0DE_L0CAL extrinsic model. 

The directives, in either case, do not add significant complexity to the source 
code. Default processor arrays and node arrays in missing ONTO clauses can be 
chosen in such a way that many HPF programs do not need any modification 
at all. But the new extensions will be very helpful when the shared-memory 
parallelism is more advantageous than the distributed-memory parallelism for 
certain dimensions of the involved arrays. 

A complete specification of the syntax and semantics of the proposed lan- 
guage extensions is subject of ongoing work. Major issues are the provision of 
data mapping mechanisms with respect to subsets of nodes as well as an in- 
tegration of the new features with the HPF and/or OpenMP tasking model. 
Moreover, the evaluation of the new execution model in comparison with a pure 
message-passing model and the investigation of its relevance for a larger range 
of scientific applications will be addressed in future work. 



High-Level Data Mapping for Clusters of SMPs 



15 



References 

1. S. Benkner. VFC: The Vienna Fortran Compiler. Scientific Programming, 7(1):67- 
81, 1999. 

2. S. Benkner and T. Brandes. Exploiting Data Locality on Scalable Shared Memory 
Machines with Data Parallel Programs. In Euro-Par 2000 Parallel Processing, 
Munich, pages 647-657. Lecture Notes in Computer Science (1900), September 
2000 . 

3. J. Bircsak, P. Craig, R. Crowell, Z. Cvetanovic, J. Harris, C. Nelson, and C. Offner. 
Extending OpenMP for NUMA Machines. In Proceedings of SC 2000: High Per- 
formance Networking and Computing Conference, Dallas, November 2000. 

4. T. Brandes and F. Zimmermann. ADAPTOR - A Transformation Tool for HPF 
Programs. In K.M. Decker and R.M. Rehmann, editors. Programming Environ- 
ments for Massively Parallel Distributed Systems, pages 91-96. Birkhauser Verlag, 
April 1994. 

5. F. Cappello and D. Etieble. MPI versus MPI-bOpenMP on the IBM SP for the 
NAS Benchmarks. In Proceedings of SC 2000: High Performance Networking and 
Computing Conference, Dallas, November 2000. 

6. B. Chapman, P. Mehrotra, and H. Zima. Enhancing OpenMP with Features for 
Locality Control. In Proc. ECWMF Workshop "Towards Teracomputing - The Use 
of Parallel Processors in Meteorology" , 1998. 

7. E. Dockner, H. Moritsch, G. Ch. Pflug, and A. Swietanowski. AURORA financial 
management system: From Model Design to Implementation. Technical report 
AURORA TR1998-08, University of Vienna, June 1998. 

8. O. Haan. Matrix Transpose with Hybrid OpenMP / MPI Paral- 
lelization. Technical Report Presentation given at SCICOMP 2000, 

tittp : //WWW. spscicomp. org/2000/userpres .html\#haar 2000. 

9. D. S. Henty. Performance of Hybrid Message-Passing and Shared-Memory Par- 
allelism for Discrete Element Modeling. In Proceedings of SC 2000: High Perfor- 
mance Networking and Computing Conference, Dallas, November 2000. 

10. High Performance Fortran Forum. High Performance Fortran Language Specifi- 
cation. Version 2.0, Department of Computer Science, Rice University, January 
1997. 

11. Y. Hu, H. Lu, A. Cox, and W. Zwaenepel. Openmp for networks of smps. In 
Proceedings of IPPS., 1999. 

12. M. Leair, J. Merlin, S. Nakamoto, V. Schuster, and M. Wolfe. Distributed OMP 
- A Programming Model for SMP Clusters. In Eighth International Workshop on 
Compilers for Parallel Computers, pages 229-238, Aussois, France, January 2000. 

13. Message Passing Interface Forum. MPI: A Message-Passing Interface Standard. 
Vers. 1.1, June 1995. MPI-2: Extensions to the Message-Passing Interface, 1997. 

14. H. Moritsch and S. Benkner. High Performance Numerical Pricing Methods. In 
f-th Inti. HPF Users Croup Meeting, Tokyo, October 2000. 

15. M. Sato, S. Satoh, K. Kusano, and Y. Tanaka. Design of openmp compiler for an 
smp cluster. In Proceedings EWOMP ’99, pp. 32-39., 1999. 

16. Silicon Graphics Inc. MIPSpro Power Fortran 77 Programmer’s Guide: OpenMP 
Multriprocessing Directives. Technical Report Document 007-2361-007, 1999. 

17. The OpenMP Forum. OpenMP Fortran Application Program Interface. Version 
1.1, November 1999. ittp://www. openmp. orf 



Integrating Task and Data Parallelism by Means 
of Coordination Patterns* 



Manuel Diaz, Bartolome Rubio, Enrique Soler, and Jose M. Troya 



Dpto. Lenguajes y Ciencias de la Computacion, Malaga University 
29071 Malaga, Spain 
{mdr , tolo, esc , troya}@lcc .uma. es 
tittp : / /WWW . Icc .uma. eE 



Abstract. This paper shows, by means of some examples, the suitabil- 
ity and expressiveness of a pattern-based approach to integrate task and 
data parallelism. Coordination skeletons or patterns express task paral- 
lelism among a collection of data parallel HPF tasks. Patterns specify 
the interaction among domains involved in the application along with 
the processor and data layouts. On the one hand, the use of domains, 
i.e. regions together with some interaction information, improves pat- 
tern reusability. On the other hand, the knowledge at the coordination 
level of data distribution belonging to the different HPF tasks is the 
key for an efficient implementation of the communication among them. 
Besides that, our system implementation requires no change to the run- 
time system support of the HPF compiler used. We also present some 
experimental results that show the efficiency of the model. 



1 Introduction 

High Performance Fortran (HPF) o has emerged as a standard data parallel, 
high level programming language for parallel computing. However, a disadvan- 
tage of using a parallel language like HPF is that the user is constrained by 
the model of parallelism supported by the language. It is widely accepted that 
many important parallel applications cannot be efficiently implemented follow- 
ing a pure data-parallel paradigm: pipelines of data parallel tasks o , a common 
computation structure in image processing, signal processing or computer vision; 
multi-block codes containing irregularly structured regular meshes Q ; multidis- 
ciplinary optimization problems like aircraft design Q. For these applications, 
rather than having a single data-parallel program, it is more appropriate to sub- 
divide the whole computation into several data-parallel pieces, where these run 
concurrently and co-operate, thus exploiting task parallelism. 

Integration of task and data parallelism is currently an active area of re- 
search and several approaches have been proposed ^3^3 ^3- Integrating the 
two forms of parallelism cleanly and within a coherent programming model is 
difficult Q. In general, compiler-based approaches are limited in terms of the 
forms of task parallelism structures they can support, and runtime solutions 

* This work was supported by the Spanish project CICYT TIC-99-1083-C02-01. 

F. Mueller (Ed.): HIPS 2001, LNCS 2026, pp. IS-Q 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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require that the programmer have to manage task parallelism at a lower level 
than data parallelism. The use of coordination models and languages ^ and 
structured parallel programming D is proving to be a good alternative, pro- 
viding high level mechanisms and supporting different forms of task parallelism 
structures in a clear and elegant way 

In 01 we presented DIP (Domain Interaction Patterns), a new approach 
of integrating task and data parallelism using skeletons. DIP is a high level 
coordination language to express task parallelism among a collection of data 
parallel HPF tasks, which interact according to static and predictable patterns. 
It allows an application to be organized as a combination of common skeletons, 
such as multi-blocking or pipelining. Skeletons specify the interaction among 
domains involved in the application along with the mapping of processors and 
data distribution. 

On the one hand, the use of domains, which are regions together with some 
interaction information such as borders, make the language suitable for the so- 
lution of numerical problems, especially those with an irregular surface that can 
be decomposed into regular, block structured domains. In this paper, we prove 
how it can be successfully used on the solution of domain decomposition-based 
problems and multi-block codes. Moreover, we show how other kinds of prob- 
lems with a communication pattern based on (sub)arrays interchange (2-D FFT, 
Convolution, Narrowband Tracking Radar, etc.) may be defined and solved in 
an easy and clear way. The use of domains also avoids that some computational 
aspects involved in the application, such as data types, have to appear at the co- 
ordination level, as it occurs in other approaches This improves pattern 

reusability. 

On the other hand, the knowledge at the coordination level of data distribu- 
tion belonging to the different HPF tasks is the key for an efficient implemen- 
tation of the communication and synchronization among them. In DIP, unlike 
in other proposals O.B. the inter-task communication schedule is established 
at compilation time. Moreover, our approach requires no change to the runtime 
support of the HPF compiler used. In this paper, we also present some imple- 
mentation issues of a developed initial prototype and confirm the efficiency of 
the model by means of some experimental results. 

The rest of the paper is structured as follows. Section 1.1 discusses related 
work. Section 2 gives an overview of DIP. In section 3, the expressiveness and 
suitability of the model to integrate task and data parallelism is demonstrated 
by means of some examples. Section 4 discusses the implementation issues and 
preliminary results and, finally, in section 5, some conclusions are sketched. 



1.1 Related Work 

In recent years, several proposals have addressed integration of task and data 
parallelism. We shall state a few of them and discuss the relative contributions 
of our approach. 

The Fx model ca expresses task parallelism by providing declaration direc- 
tives to partition processors into subgroups and execution directives to assign 
computations to different subgroups (task regions). These task regions can be 
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dynamically nested. The new standard HPF 2.0 Q3 of the data parallel language 
HPF provides approved extensions for task parallelism, which allow nested task 
and data parallelism, following a similar model to that of Fx. These extensions 
allow the spawning of tasks but do not allow interaction like synchronization 
and communication between tasks during their execution and therefore might 
be too restrictive for certain application classes. Differently from these proposals, 
DIP does not need the adoption of new task parallel HPF constructs to express 
task parallelism. DIP is a coordination layer for HPF tasks which are separately 
compiled by an off-the-shelf HPF compiler that requires no change, while the 
task parallel coordination level is provided by the corresponding DIP library. 

In HPF/MPI ini, the message-passing library MPI has been added to HPF. 
This definition of an HPF binding for MPI attempts to resolve the ambiguities 
appeared when a communication interface for sequential languages is invoked 
from a parallel one. In an HPF/MPI program, each task constitutes an inde- 
pendent HPF program in which one logical thread of control operates on arrays 
distributed across a statically defined set of processors. At the same time, each 
task is also one logical process in an MPI computation. In our opinion, the adop- 
tion of a message-passing paradigm to directly express HPF task parallelism is 
too low-level. Moreover, in our approach, the inter-task communication schedule 
is established at compilation time from the information provided at the coor- 
dination level related to the inter-domain connections and data distribution. In 
this case, expressiveness and good performance are our relative contributions. 

Another coordination language for mixed task and data parallel programs 
has been proposed in 13 . The model provides a framework for the complete 
derivation process in which a specification program is transformed into a coor- 
dination program. The former expresses possible execution orders between mod- 
ules and describes the available degree of task parallelism. The latter describes 
how the available degree of parallelism is actually exploited for a specific par- 
allel implementation. The result is a complete description of a parallel program 
that can easily be translated into a message-passing program. This proposal is 
more a specification approach than a programming approach. The programmer 
is responsible for specifying the available task parallelism, but the final decision 
whether the available task parallelism will be exploited and how the processors 
should be partitioned into groups is taken by the compiler. Moreover, it is not 
based on HPF. The final message-passing program is expressed in C with MPI. 

Possibly the closest proposal to DIP is taskHPF Q. It is also a high level 
coordination language to define the interaction patterns among HPF tasks in 
a declarative way. Applications considered are also structured as ensembles of 
independent data parallel HPF modules, which interact according to static and 
predictable patterns. taskHPF provides a pipeline pattern and directives which 
help the programmer in balancing the pipelined stages: ON PROCESSORS direc- 
tive fixes the number of processors assigned to an HPF task and REPLICATE 
directive can be used to replicate non-scalable stages. Patterns can be composed 
together to build complex structures in a declarative way. Our approach has also 
a pipeline pattern with similar directives. However, the differences are substan- 
tial: a) we work with domains, without considering data types at coordination 
level, which can improve pattern reusability; b) our pattern do not force the 
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utilization of ending marks, such as END_OF_STREAM, in the computational part, 
i.e. inside an HPF task; c) our pattern provides information about the future 
data distribution together with the processor layout, which allows scheduling the 
inter-task communication pattern at compilation time. On the other hand, DIP 
provides a multi-block pattern that make the language suitable for the solution 
of domain decomposition-based problems and multi-block codes. 

The implementation of taskHPF is based on COLThpf o. a runtime sup- 
port specifically designed for the coordination of concurrent and communicating 
HPF tasks. It is implemented on top of MPI (there is a new version using PVM) 
and requires small changes to the runtime support of the HPF compiler used. 
DIP implementation is based on BCL a Border-based Coordination Lan- 
guage focused on the solution of numerical problems, especially those with an 
irregular surface that can be decomposed into regular, block structured domains. 
BCL is also implemented on top of the MPI communication layer, but no change 
to the HPF compiler has been needed. 

Finally, it is worthy of remark some others skeletal coordination approaches 
with goals quite different from those of DIP. In Q, Activity Graphs are defined 
to provide an intermediate layer for the process of skeletal program compilation, 
serving as a common, language independent target notation for the translation 
from purely skeletal code, and as the source notation for the specific phase of 
base language code generation. In the former role they also provide a precise 
operational semantics for the skeletal layer. In [3, it is described the way the 
Network Of Tasks model ca is used to built programs. This model is a extremely 
powerful program composition technique which is both semantically clean and 
transparent about performance. Software developers can reliably predict the 
performance of their programs from the knowledge of the the performance of 
the component nodes and the visible graph structure. 

2 The DIP Coordination Language 

In this section we only give a brief summary of the DIP language. More detailed 
description of DIP appears in 

DIP is a high level coordination language which allows the definition of a 
network of cooperating HPF tasks, where each task is assigned to a disjoint set 
of processors. Tasks interact according to static and predictable patterns and 
can be composed using predefined structures, called patterns or skeletons, in 
a declarative way. Besides defining the interaction among tasks, patterns also 
specify processor and data layouts. DIP is based on the use of domains, i.e. 
regions together with some interaction information that will allow efficient inter- 
task coordination. 

We have initially established two patterns in DIP. The MULTIBLDCK pattern 
is focused on the solution of multi-block and domain decomposition-based prob- 
lems, which conform an important kind of problems in the high performance 
computing area. This pattern specifies the different blocks or domains that form 
the problem and also establishes the coordination scheme among tasks. For the 
latter role, it defines the borders among domains and establishes the way these 
borders will be updated. 
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The other skeleton provided by DIP is the PIPE pattern, which pipelines 
sequences of tasks in a primitive way. It is also based on the use of domains, 
which avoids that computational aspects such as data types have to appear at the 
coordination level, improving pattern reusability. In this case, no border among 
domains has to be explicitly specified, since all data associated to a domain are 
involved in the interaction. Nested pipeline patterns are allowed so that complex 
structures can be built in a declarative way. 

HPF tasks receive the domains they need and use them to establish the 
necessary variables for computation. Local computations are achieved by means 
of HPF sentences while the communication and synchronization among tasks 
are carried out through some incorporated DIP primitives (PUT_DATA, GET_DATA, 
converge), a new type (DDMAINxD) and a new attribute (GRIDxD) have also been 
included. 

3 Two Simple Examples 

3.1 Example 1. Laplace’s Equation 

The following program shows the MULTIBLOCK pattern for an irregular problem 
that solves Laplace’s equation in two dimensions using Jacobi’s finite differences 
method with 5 points. 

Au = 0 in n (1) 

where rt is a real function, 12 is the domain, a subset of , and Dirichlet bound- 
ary conditions have been specified on 912, the boundary of 12: 

u = g in 912 (2) 



MULTIBLOCK Jacobi u/1 , 1 ,Nxu,Nyu/ , v/1 , 1 ,Nxv,Nyv/ 
solve (u : (BLOCK , BLOCK) ) ON PROCS (4,4) 
solve (v: (BLOCK, BLOCK)) ON PR0CS(2,2) 

WITH BORDERS 

u(Nxu,Nyl,Nxu,Ny2) <- v(2,l,2,Nyv) 
v(l,l,l,Nyv) <- u(Nxu-l,Nyl,Nxu-l,Ny2) 

END 

The domains in which the problem is divided are shown in Figure ntogether 
with a possible data distribution and the border between domains. Dotted lines 
represent the distribution into each HPF task. A domain definition is achieved 
by means of an assignment of Cartesian points, i.e. the region of the domain 
is established. For example, the expression u/1 , 1 ,Nxu,Nyu/ assigns to the do- 
main u the region of the plane that extends from the point (1,1) to the point 
(Nxu,Nyu) . A border is specified by means of the <- operator. For example, the 
expression u(Nxu,Nyl ,Nxu,Ny2) <- v(2,l,2,Nyv) indicates that the zone of 
u delimited by points (Nxu,Nyl) and (Nxu,Ny2) will be updated by the values 
belonging to the zone of v delimited by points (2,1) and (2,Nyv). 

In the task call specification, the name of the domain (say u) to be solved 
by the task (solve) and the data distribution (for example (BLOCK, BLOCK) ) are 
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Fig. 1. Communication between two HPF tasks. 



specified. The processor layout is also indicated (for example ON PR0CS(4,4)). 
The distribution types correspond to those of HPF. This declaration does not 
perform any data distribution but indicates the future distribution of data asso- 
ciated to the specified domain. The knowledge at the coordination level of data 
distribution is the key for an efficient implementation of the communication 
among HPF tasks. A task knows the distribution of its domain and the distri- 
bution of every domain with a border in common with its domain by means of 
the information declared in the pattern. So, it can be deduced which part of the 
border needs to be sent to which processor of other task. This is achieved at 
compilation time. 

Finally, the code of subroutine solve used for the two tasks specified in 
the pattern is shown below. Line 2 declares a domain variable for the received 
domain. Line 3 uses the attribute GRID to declare two record variables g and 
g_old. After domain and grid declarations, line 4 is a special kind of distribution 
which produces the distribution of the field DATA (an array of real numbers) 
and the replication of the field DOMAIN (the associated domain) of both g and 
g_old variables. In lines 5 and 6, arrays g°/oDATA and g_old7,DATA are dynamically 
created at the same time that the domain d is assigned to the fields g°/oDOMAIN 
and g_old7D0MAIN, respectively. The initialization of g7DATA is performed in 
the subroutine called in line 7. Statement 9 produces the assignment of the 
two variables with GRID attribute. Since g_old has its domain already defined, 
this instruction will just produce a copy of the values of the field g°/oDATA to 
g_oId"/.DATA. 

Lines 10 and 11 are the first where communication is achieved. Data from 
g7oDATA needed by each task are communicated. Local computation is accom- 
plished by the subroutines called in lines 12 and 13 while the convergence is 
tested in line 14. The instruction CONVERGE causes a communication between 
the two tasks. In this case, the communicated data is the value of the variable 
error. The maximum value (calculated by function maxim) obtained in each 
process is assigned to the variable error once the execution of CONVERGE is 
finished. 
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1) subroutine solve (d) 

2) DQMAIN2D d 

3) double precision, GRID2D : ; g,g_old 

4) !hpf$ distribute (BLOCK, BLOCK) : :g,g_old 

5) g”/D0MAIN = d 

6) g_old7.D0MAIN = d 

7) call initGrid (g) 

8) do i=l, niters 

9) g_old = g 

10) PUT_DATA (g) 

11) GET_DATA (g) 

12) call computeLocal (g, g_old) 

13) error = computeNorm (g, g_old) 

14) CONVERGE (g, error, maocim) 

15) Print *, "Max norm: ", error 

16) enddo 

17) end subroutine solve 

3.2 Example 2. 2-D Fast Fourier Transform 

2-D FFT transform is probably the application most widely used to demonstrate 
the usefulness of exploiting a mixture of both task and data parallelism EIE- 
Given an NxN array of complex values, a 2-D FFT entails performing N inde- 
pendent 1-D FFTs on the columns of the input array, followed by N independent 
1-D FFTs on its rows. In order to increase the solution performance and scala- 
bility, a pipeline solution scheme is preferred as proved in m and Q. Figure Q 
shows the array distributions needed for that scheme. 




Fig. 2. Array distributions for 2-D FFT. 



This mixed task and data parallelism scheme can be easily codified using DIP. 
The following code shows the PIPE pattern. A domain d/l,l,N,N/ is defined 
for representing the application array at the coordination level. Again, data 
distribution and processor layout are indicated in the task call specification. 

PIPE FFT2D 

cfft(d/l,l,N,N/: (*, BLOCK)) ON PR0CS(4) 
rfftCd: (BLOCK,*)) ON PR0CS(4) 

END 
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The code below shows the two stages. The stage cf ft reads an input element, 
performs the 1-D transformations and calls PUT_DATA(a). The stage rfft calls 
GETJDATA(b) to receive the array, performs the 1-D transformations and write 
the result. The communication schedule is known by both tasks, so that a point 
to point communication between the different HPF processors can be carried 
out. 

1) subroutine cfft (d) 

2) D0MAIN2D d 

3) complex, GRID2D :: a 

4) !hpf$ distribute a(*, block) 

5) a7.D0MAlN= d 

6) do i= 1, n_images 

7) call read_stream (a°/oDATA) ! read input 

8) !hpf$ independent 

9) do icol = 1, N 

10) call fftSlice(a7„DATA( : , icol) ) 

11) enddo 

12) PUT_DATA (a) 

13) enddo 

14) end 

1) subroutine rfft (d) 

2) D0MA1N2D d 

3) complex, GR1D2D :: b 

4) !hpf$ distribute bCblock,*) 

5) b7.D0MAlN= d 

6) do i= 1, n_images 

7) GET_DATA (b) 

8) !hpf$ independent 

9) do irow = 1, N 

10) call fftSlice(b7oDATA(irow, : ) ) 

11) enddo 

12) call write_stream (b7oDATA) ! write output 

13) enddo 

14) end 

An alternative solution that shows how nested PIPE patterns are used is given 
in the following code. 

PIPE FFT2D INOUT d/l,l,N,N/ 
cfftCd: (*, BLOCK)) ON PRDCS(4) 
rfftCd: (BLOCK,*)) ON PROGS (4) 

END 

PIPE Alternative_Solution 

lnput(d/l,l,N,N/: (*, BLOCK)) ON PROCS(l) 

FFT2D(d) 

Output (d: (BLOCK,*)) ON PROCS(l) 

END 
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A list of input/output domain definitions must appear after the nested pat- 
tern name. One of the predefined word IN, OUT, INDUT must precede each domain 
definition. Here, the domain involved in the pattern FFT2D is defined as an in- 
put/output domain. This pattern is ’’called” in the second stage of the main 
PIPE pattern. Subroutines cfft and rfft shown above must be modified since 
read/write operations are now carried out by Input and Output stages. So, line 
7 in subroutine cfft is substituted by a call to GETJDATA(a), and line 12 in 
subroutine rfft is now a call to PUT_DATA(b). 

Although the first solution is shorter, the second one establishes a more useful 
PIPE FFT2D pattern, since it can be reused in more complex applications, such as 
Convolution, which is a standard technique used to extract feature information 
from images. It involves two 2-D FFTs, an elementwise multiplication, and an 
inverse 2-D FFT and it is applied to two streams of input images to generate a 
single output stream. 

4 Implementation Issues and Results 

In order to evaluate the performance of DIP, a prototype has been developed. 
Several examples have been used to test it and the obtained preliminary results 
have successfully proved the efficiency of the proposal. Here, we show the results 
for Jacobi’s method and the 2-D FFT problem explained above. 

For designing our initial prototype, we have built a compiler that translates 
our DIP code to BCL code. The implementation is based on source-to-source 
transformations together with the necessary libraries and it has been realized on 
top of the MPI communication layer and the public domain HPF compilation 
system ADAPTOR [^. No change to the HPF compiler has been needed. In a 
BCL program there are one coordinator process and one or several worker pro- 
cesses. The coordinator process is in charge of establishing all the coordination 
aspects and creating the worker processes. A worker process is the computational 
task. So, in our DIP to BCL transformation phase, we have created the coordi- 
nator process from the coordination pattern and the worker processes from our 
computational tasks. 

A cluster of 4 nodes DEC AlphaServer 4100 interconnected by means of 
Memory Channel has been used. Each node has 4 processors Alpha 22164 (300 
MHz) sharing a 256 MB RAM memory. The operating system is Digital Unix 
V4.0D (Rev. 878). 

Table O reports the ratio between the HPF and DIP execution times for Ja- 
cobi’s method for the different domains of the problem and numbers of processors 
exploited. We have considered 2, 4 and 8 domains with a 128 x 128 grid each 
one. The program has been executed for 20000 iterations. Table 1 highlights the 
better performance of the mixed task and data-parallel implementation. When 
the number of processors is equal to the number of domains (only task paral- 
lelism is achieved) DIP has also shown better results. Only when there are more 
domains than available processors, DIP has shown less performance because of 
the context change overhead among weight processes. 

TableQreports the HPF /DIP ratio for different problem sizes of the 2-D FFT 
application. Again, the performance of DIP is generally better. However, HPF 
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Table 1. HPF/DIP ratio for Jacobi’s method. 



Domains 


HPF/DIP ratio 


4 Procs. 


8 Procs. 


16 Procs. 


2 


1.03 


1.27 


1.49 


4 


1.04 


1.57 


2.38 


8 


0.93 


1.57 


2.90 



Table 2. HPF/DIP ratio for 2-D FFT. 



Size 


HPF/DIP ratio 


4 Procs. 


8 Procs. 


16 Procs. 


32 by 32 


1.59 


2.08 


1.47 


64 by 64 


1.09 


1.44 


1.83 


128 by 128 


1.03 


1.08 


1.25 



performance is near DIP as the problem size becomes larger and the number of 
processors decreases, as it also happens in other approaches m . In this situation, 
HPF performance is quite good and so, the integration of task parallelism does 
not contribute so much. 

5 Conclusions 

We have used DIP, a Domain Interaction Pattern-based high level coordination 
language, to integrate task and data parallelism. The suitability and expres- 
siveness of the model have been proved by means of some examples. We have 
also confirm the efficiency of the approach discussing some experimental results 
obtained with an initial prototype. The main advantage of this approach is to 
supply programmers with a concise, pattern-based, high level declarative way to 
describe the interaction of their HPF tasks. By means of predefined skeletons, 
the programmer can express task parallelism among a collection of data parallel 
HPF tasks, so that task and data parallelism integration is achieved. The use of 
domains and the establishment of data and processor layouts at the coordination 
level allow pattern reusability and efficient implementations, respectively. 
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Abstract. One of the major challenges facing high performance computing is 
the daunting task of producing programs that will achieve acceptable levels of 
performance when run on parallel architectures. Although many organizations 
have been actively working in this area for some time, many programs have yet 
to be parallelized. Furthermore, some programs that were parallelized were 
done so for obsolete systems. These programs may run poorly, if at all, on the 
current generation of parallel computers. Therefore, a straightforward approach 
to parallelizing vectorizable codes is needed without introducing any changes to 
the algorithm or the convergence properties of the codes. Using the 
combination of loop-level parallelism, and RISC-based shared memory SMPs 
has proven to be a successful approach to solving this problem. 

Keywords: Parallel programming, high performance computer, super 

computer, loop-level parallelism. 



1 Introduction 

One of the major challenges facing high performance computing is the daunting 
task of producing programs that will achieve acceptable levels of performance when 
run on parallel architectures.' In order to meet this challenge, the program must 
simultaneously achieve three goals: 

1) Achieve a reasonable level of parallel speedup at an acceptable cost. 

2) Demonstrate an acceptable level of serial performance so that moderate sized 
problems do not require enormous levels of resources. 

3) Use an algorithm with a high enough level of algorithmic efficiency that the 
problem remains tractable. 
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Even though many organizations have been actively working in this area for 
5 10 y ears (or longer), many programs have yet to be parallelized. Furthermore, some 
programs that were parallelized, were done so for what are now obsolete systems 
(e.g., SIMD computers from Thinking Machines and MASPAR), and these programs 
run poorly, if at all, on the current generation of parallel computers. There has also 
been a problem that some approaches to parallelization can subtly change the 
algorithm and result in convergence problems when using large numbers of 
processors [1]. This is a common problem, particularly when using domain 
decomposition w ith i mplicit CFD^ codes. There are algorithmic solutions to this 
problem (e.g., multigrid codes or the use of a preconditioner); however, many of these 
solutions have problems in their own right (e.g., poor scalability). 

At the other end of the spectrum, there are those who champion automatic 
parallelization. They expect to soon be able to parallelize production codes for 
efficient execution on modern production hardware. Unfortunately, as a general rule, 
this has not happened. 

From discussions such as this, talks with numerous researchers, and the authors’ 
research at the U.S. Army Research Laboratory, the following can be concluded: 
Writing parallel programs is a challenge. 

Writing efficient serial programs on today’s RBC and CISC processors with 
their memory hierarchies (i.e., cache) is a challenge. 

Requiring the program to show near-linear scalability out to 
hundreds/thousands of processors greatly complicates matters. 

Requiring the program to show portable performance across all or most 
modem parallel architectures greatly complicates matters. 

Modern processors are fast enough that for many problems that have 
traditionally been considered to be the sole domain of supercomputers, they 
may now be solvable using a moderate sized system (e.g., 10 100 GFLOPS of 
peak processing power) given a sufficiently efficient algorithm and 
implementation. 

Therefore, a straightforward approach to the parallelization of one or more 
important classes of codes is needed. This approach will meet the following 
requirements: 

It will work with a class of machines that has more than one member in it, but 
it need not include the entire universe of parallel computers. 

It will not require an unreasonable amount of effort. 

The results achieved by using this approach must satisfy the needs of the user 
community. 

The combined hardware and software costs must be acceptable. 

At least for small- to moderate-sized problems, it must be possible to complete 
the project before the equipment is obsolete. 

The remainder of this paper will begin by discussing an approach developed at the 
U.S. Army Research Laboratory that was designed to meet all of these requirements 
for a large and important class of codes. Following this, the results of applying this 
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approach to a specific production code will be discussed. It will conclude by 
considering some issues in greater detail and with a discussion of related work. 



2 Class of Codes 

For this project, the class of codes that was selected was the class of vectorizable 
codes. Of particular interest were those vectorizable codes that were considered to be 
nonparallelizable. A representative code was selected (F3D, an implicit CFD code) 
[2]. The following factors make this class of codes particularly interesting: 

From the mid-1970s to the mid-1990s, the terms vector computers and 
supercomputers were nearly synonymous (e.g., Cray C90). 

Many of the traditional vectorizable algorithms are known to be 
computationally more efficient than the algorithms most frequently associated 
with parallel computing. 

If one can efficiently tune one of these jobs to run on a parallel computer, then 
any job that exhibits an acceptable level of performance when using one 
processor of a C90 should exhibit an acceptable level of performance when 
using a modest number of RISC processors. 

Problem sizes that are ten times greater (that is ten times overall, not ten times 
in each direction) are likely to exhibit an acceptable level of performance when 
using a moderate-sized system (e.g., 10 100 GFLOPS of peak processing 
power). 

In recent years, SGI, SUN, Convex/HP, DEC/Compaq, and IBM have 
produced SMPs with this level of performance. 

It is our belief and experience that many of these systems are extremely well 
suited for running programs that have been parallelized using loop-level 
parallelism (e.g ., OpenMP). 

Since vectorization is a form of loop-level parallelism, there is reason to hope 
that many vectorizable programs can be parallelized using this technique. 
While it is not always clear that this will be the best approach to parallelizing 
these programs, if one focuses on programs that are hard to parallelize, that 
objection should be effectively eliminated. 

Unfortunately, this is not the end of the story. It would be nice if things were this 
simple, but they are not. Three important obstacles had to be conquered before this 
project could get off of the ground: 

1) Sufficiently powerful SMPs had to come onto the market. At the start of this 
project, no one had yet produced a RISC-based SMP with a peak speed of 10 
GFLOPS, let alone 100 GFLOPS. 

2) All RISC processors use caches, and these processors were generally 
considered to be poorly suited to the needs of running scientific codes [3]. 

3) Vectorization is normally applied to the inner-most loop of a loop nest. 
However, as will be seen in section three, when using OpenMP and similar 
techniques, one is well advised to parallelize outer (or at least middle) loops. In 
some cases, this will necessitate the interchanging of loops in the loop nest. It 
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may also be desirable to perform other transformations such as combining 
loops under a common outer loop. 



3 Symmetric Multiprocessors 

There are primarily four types of parallel computers in use today: 

1) Distributed memory systems. These are also sometimes referred to as s hared 
nothing. T hey are almost always programmed using a message-passing library 
(in recent years, Aff’/has become the standard library). 

2) Globally addressable memory. Cray Research frequently referred to the Cray 
T3D and T3E as shared memory architectures. A more appropriate description 
is globally addressable memory. While it is true that each processor can access 
all of the memory associated with a job, there are some important differences 
between these systems and true shared memory: 

a) The normal load and store instructions that are normally used to access the 
local memory cannot be used to access the remote memory. 

b) While it is theoretically possible to use the same instructions to perform 
loads and stores involving both local and remote memory, these 
instructions are not cache coherent. 

c) The introduction of the synchronization events needed to replace hardware 
coherency with software coherency can significantly interfere with the 
performance of the code. Since this requires replacing an automatic 
function of the hardware with a manual function of the code writer, it can 
greatly complicate efforts to produce valid code. 

3) Master slave. There have been a wide range of these systems produced in the 
past, including many of the SIMD-hased systems. For the purpose of this 
paper, we will only discuss shared memory systems using this organization. 
Their advantage was that this organization made it easy to port a uniprocessor 
operating system to a parallel computer. Their disadvantage was that their 
performance scaled very poorly. As a result, few such systems are still being 
produced. 

4) Symmetric multiprocessor. While more than one type of system might conform 
to this title, this paper will only use it to refer to shared memory systems in 
which most, if not all, critical sections of the operating system can run on any 
of the processors (this is where the term symmetric comes from) [4]. 
Furthermore, if the processors have one or more levels of cache (as all RISC- 
based systems do, but which most vector based systems lack), the system is 
cache coherent (otherwise this paper would refer to it as globally addressable). 

Now consider the constraints that the choice of hardware places on the effort to 
efficiently parallelize a program. When using loop-level parallelism based on the use 
of compiler directives, there is no need to explicitly generate messages. The data will 
flow between processors and main memory (in some cases, it will even flow directly 
between two processors) as needed (for the time being, questions of efficiency in a 
NUMA or COMA architecture will be ignored; section 7 will discuss these questions). 
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Therefore, the main cost of parallelization will be assumed to be the synchronization 
cost associated with exiting a parallel section of code. On different machines and load 
factors, the synchronization cost (for scalable systems) ranges from 2,000 to 
1 million cycles (or more). From the standpoint of efficiency, one would like to keep 
these costs below 1% of the runtime. Table 1 shows how many cycles of work must 
be associated with the loop when run on a single processor in order to achieve this 
goal. It is important to keep in mind that the synchronization cost is highly dependent 
on the system load and the design of the memory system but is almost independent of 
the design of the processor. Therefore, as the memory latency (when expressed in 
cycles) for a random memory location continues to increase, the synchronization costs 
would also be expected to increase. 



Table 1. The minimum amount of work (in cycles) per parallelized loop required for 
efficient execution. 



Hypothetical synchronization cost (in cycles) 

Number 



of processors used 


10,000 


100,000 


1,000,000 


2 


2,000,000 


20,000,000 


200,000,000 


8 


8,000,000 


80,000,000 


800,000,000 


32 


32,000,000 


320,000,000 


3,200,000,000 


128 


128,000,000 


1,280,000,000 


12,800,000,000 



Two conclusions can be reached from Table 1: 

1) It is imperative to minimize the synchronization costs. 

2) Every effort should be made to maximize the amount of work per 
synchronization event (see Table 2). 

Table 2 clearly demonstrates the advantage of parallelizing primarily outer loops. 
It also demonstrates the difficulty associated with the efficient parallelization of 
boundary condition routines. As such, it is frequently desirable to leave such routines 
unparallelized. However, for larger numbers of processors, this may result in 
problems with Amdahl’s Law (too much time spent executing serial code). 



4 The Approach 

Since this approach is predicated on the assumption that one can achieve the desired 
level of performance when using modest-to-moderate numbers of processors, the first 
step is to improve the serial efficiency of the code. As was previously stated, the 
general consensus was that this would be difficult, possibly even impossible, to 
achieve. However, our experience showed that this need not be the case, which is not 
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Table 2. The available amount of work (in cycles) per synchronization event for a 
I million grid point zone. 



Problem type Grid dimension 


No. of 

loop iterations 


Work p 

10 


er grid point (in cycles) 
100 1000 


l-D 


1,000,000 


1,000,000 


10,000,000 


100,000,000 


1,000,000,000 


2-D 


lOOOWOOO 


1,000 










Inner loop 




10,000 


100,000 


1,000,000 




Outer loop 




10,000,000 


100,000,000 


1,000,000,000 




Boundary condition 




10,000 


100,000 


1,000,000,000 


3-D 


loomoomoo 


100 










Inner loop 




1,000 


10,000 


100,000 




Middle loop 




100,000 


1,000,000 


100,000,000 




Outer loop 




10,000,000 


100,000,000 


1,000,000,000 




Boundary condition-inner 


loop 


1,000 


10,000 


100,000 




Boundary condition-outer loop 


100,000 


1,000,000 


100,000,000 



to say that the effort could be completed in a few days. There were four main 
concepts used in this part of the effort: 

1) Use a lar^ memory SMP, which was a key enabling factor for this part of the 
effort. It is easier to perform serial tuning when working with serial code. 

2) Use traditional techniques such as reordering of loops and/or array indices, 
blocking, and matrix transpose operations to increase the locality of reference. 

3) Reorder the work so that rather than streaming data into and out of the 
processor, the code would stress maximizing the amount of work per cache 
miss.^ 

4) Adjust the size of scratch arrays so that they can be locked into cache. In 
particular, there were two key loops in F3D that had dependencies in two out 
of three directions. In order to vectorize these loops, the original programmers 
had to process data one plane at a time. This meant that the size of the scratch 
arrays were proportional to the size of a plane of data. Clearly for large 
problems, these scratch arra)s were unlikely to fit into even the lar^st caches. 
However, since RISC processors do not rely on vectorization to achieve high 
levels of performance, it was possible to resize these arrays to hold just a single 
row or column of a single plane of data. The arrays now comfortably fit in a 
1 MB cache for zone dimensions ranging up to about 1000. 



^ In this respect, implicit CFD codes have a definite advantage over explicit CFD codes, since they do more 
work per time step. 
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Therefore, one can see that while SMPs suffer fron a largpr memory latency 
(relative to the memory latency of most woihstations and MPPs, the presence of lar^ 
caches can more than mate up for this limitation. 

Once most of the serial tunii^ had been completed, it was possible to parallelize 
the code. Here one of the strengths of loop-level parallelism really shined. With both 
HPF and traditional message passing code, one is generally left with ai all or nothing 
proposition. In othre words, eitho- all of the code must be parallelized, or the program 
will not run at all. When using loop-level parallelism in conjunction with an SMP, 
this is definitely not the case. Therefore, it is possible to use profiling to find the 
experrsive loops and that to parallelize them one (or a few) at a time. This allows one 
to alterrrate between parallelization and debuggirtg, which in most cases will greatly 
simplify the debuggir^ process. It will also allow one to better judge the effectiveness 
of what they have tried. In many cases, this will greatly simplify the task of 
paralelizafion compared to all-or-rrothir^ ai^roaches such as HPF or usit^ message 
passing libraries (e.g., MPI). This is not a unique observation, a sirrrilar observation 
was rtrade by Frurrrkin et al. [5], as well as othas. 

As previously mentioned, vectorization is a form of loop-level parallelism. 
Therefore, in theory one should be aMe to trse OperrMP to parallelize vectorizable 
code. In pracice, there are 6rrr additiorral steps that need to be taken if one is to 
achieve higjr levels of performance: 

1) Since vectorization deals with inna- loops, it is generally desirable to 
paralelize the outer loops, even though they are rot involved in the 
vectorization of the program (see example I). 

2) It is frequmlly desirable to merge loops togetha- (see example 2). It should be 
noted thd in some cases the merging of loops for purposes of parallelization 
can be comtrined with the serial hrrrir^ techrrique of code blocking. 

3) In some cases one can sigrrificarrtly improve perfwmaree by parallelizirrg a 
loop in a parent subroutine (in some cases, one has to first create such a loop) 
(see example 3). It should benoted that in general this optirrization will reduce 
the nurrlrer of synchrorrization events by I 3 or ders of rrragnitudd 

4) One has to be verycarefrl when dealir^ with arrajs that are park of a commcm 
block. Whoi possible, it is desirable to nrove the arrays out of the corrrmon 
blocks (in some cases, this has to be done during the serial tuningas a first step 
in the resizirrg of those array). 

Two thirrgs are inporlant to rememter: 

1) The more jrocessors that are tsed, the harder it is to jrrstify the overhead 
associated with the parallelization of bounday concfition subnutines (as well 
as other inexpensive subroutines). 

2) The more time spent in serial code, the harder it is to show benefit from using 
larger (e.g., 50+) nurrirers of processors. 

Therefore, one is left with the problem of choosing between the lesser of two 
evils: 

I) As the number rf processors increases, the speed first peaks and then stark to 
drop off (assuming that the synchronization costs are a function of the nurrlrer 
of processors being used). 
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C$doacross local (L,J,K) 

DO 10 L=1,LMAX 

DO 10 K=1,KMAX 

DO 10 J=1,JMAX 

Some computation with no dependencies in any direction 

10 CONTINUE 

Example 1. Parallelizing an outer loop. 



C$doacross local (L,J,K) 

DO 20 L=1,LMAX 

DO 10 K=1,KMAX 

DO 10 J=1,JMAX 

Body of the first loop 
10 CONTINUE 

DO 20 K=1,KMAX 

DO 20 J=1,JMAX 

Body of the second loop 

20 CONTINUE 

Example 2. Merging loops to reduce synchronization costs. 



DO 10 J=1,JMAX 

C NOTE: We will assume that there are dependencies in the J direction 

C that inhibit parallelization. 

CALL SUBA (...) 

C SUBA batches up a 2 -dimensional buffer for processing by SUBB. 

CALL SUBB (...) 

C SUBB has a dependency in one direction, which requires processing a 
C 2 -dimensional buffer if the code is to be vectorizable . 

10 CONTINUE 

(a) What the original code looked like. 



C$doacross local (L,J) 

DO 10 L=1,LMAX 

DO 10 J=1,JMAX 
CALL SUBA (...) 

C SUBA now batches up a 1 -dimensional buffer that easily fits in a 
C large cache, for processing by SUBB. 

CALL SUBB (...) 

C SUBB has a dependency in one direction, but since we are no longer 
C dealing with vector processors, this does not matter. Therefore SUBB 
C can safely process the much smaller 1-dimensional buffer. 

10 CONTINUE 

(b) The cache optimized parallelized code. 



Example 3. Parallelizing a parent subroutine. 
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2) As the number of processors increases, the speed approaches an asymptotic 
limit. 

The net effect of this is that, past a certain point, it is hard to show good speedup 
when using loop-level parallelism. While the authors are familiar with a number of 
researchers who felt that this limit would be around 4 16 processors, it is our 
experience that one should be able to efficiently use 30 128 processors (depending on 
the problem and the problem size). 

Another related problem is that most people are used to thinking in terms of 
problems that have a nearly infinite level of parallelism. In such cases, the ideal 
speedup is linear. However, with loop-level parallelism, one is frequently dealing 
with parallelizing loops that have between 10 and 1000 iterations. This means that the 
available parallelism is in the range of 10 100 0. When the number of processors is 
within roughly a factor of 10 of the available parallelism, the ideal speedup is no 
longer linear. Instead the curve shows a distinctly stair-step shape. Table 3 shows 
where this shape comes from. 

Table 3. Predicted speedup for a loop with 15 units of parallelism. 



Number of 
processors 


Maximum units of 
parallelism 
assigned to a single 
processor 


Predicted speed up 


1 


15 


1.000 


2 


8 


1.875 


3 


5 


3.000 


4 


4 


3.750 


5 7 


3 


5.000 


8 14 


2 


7.500 


15 


1 


15.000 



5 Results 

We have been able to do serial runs on an SGI Origin 2000 for problem sizes ranging 
from 1 200 million grid points without a significant decrease in the MFLOPS rate. 
This is the exact opposite of what was expected based on the literature at the time [3]. 
Furthermore, serial tuning on the SGI Power Challenge resulted in a speedup of more 
than a factor of 10. Attempts were made to measure the speedup on a Convex 
Exemplar SPP-1000. However, even though a 3 million grid point problem was being 
used, the vector version of the code was running so slowly that the job was killed 
before it had completed 10 time steps (the way things were going, this would 
probably have taken the better part of a day or more). The serial-tuned code 
completed 10 time steps in 70 min. 
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An interesting outcome from the parallelization of this code is that for larger 
numbers of processors, the performance as a function of the number of processors 
used is far from linear. Instead, the curve can best be described as a stair-step. (See 
section 4 for an explanation of this effect.) When using loop-level parallelism with a 
3-dimensional code, there can be dependencies in one or more directions for key 
loops. With a maximum loop dimension of M, the available parallelism is roughly M. 
Therefore, one can expect to see jumps in performance at M/5, M/4, M/3, M/2, and M 
processors. This effect is demonstrated in Table 3 and can be seen in the results in 
Table 4 (e.g., the nearly flat performance between 48 and 64 processors for the 
1 million grid point test case and between 88 and 104 processors for the 59 million 
grid point test case). 

Table 4 shows some representative results for the R12000 based SGI Origin 2000 
(128 processors, 300 MHz) and the UltraSPARC II based SUN HPC 10000 
(64 processors, 400 MHz).'* In reporting these results, there is the question of what is 
the best metric to use. We prefer to avoid speedup, since it fails to describe the actual 
performance of the code and can actually favor poorly performing codes (the lower 
the serial performance, the easier it is to show good speedup). From the user s 
perspective, what really matters are metrics such as run time and turn around time. 
However, as the number of processors approach the available parallelism, the 
predicted run time should asymptotically approach some low value, making it 
difficult to evaluate the performance of the code. Our preferred metric is time 
steps/hour, since it allows the user to easily estimate what the run time should be 
(not counting start up and termination costs, which have been eliminated from our 
results). This metric also has the advantage that for problems with large amounts of 
parallelism, it gives the linear performance curve that one normally equates with 
parallel programs. Another useful metric which has been included in Table 4 is the 
delivered MFLOPS, which allows one to determine not only the parallel efficiency, 
but also the serial efficiency of our implementation. The peak speed of a processor on 
the SUN system is 800 MFLOPS and 600 MFLOPS on the SGI system. From the 
results, one can see that the per processor delivered performance of the two systems is 
actually very similar. This is probably the result of the two vendors taking different 
approaches in designing their chips. Some vendors prefer to make the fastest chips 
possible, even though they have a lot of hazards that can limit their delivered 
performance, while other vendors prefer to make somewhat slower but friendlier 
chips. Both approaches are valid and can result in a good product. But as our results 
demonstrate, it is important to compare products based on their delivered 
performance, not their peak performance. 



*On the SGI Origin the CSdoacross directives were used. On the SUN HPC 10000, the SUN specific PCF 
directives were used, since at the time SUN did not support either SGI’s CSdoacross directives or the 
OpenMP directives. 
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Table 4. Measured perfcrmance of the RISC optimized shared memcry version 
ofF3D. 



Number of 
processors 
used 


Problem size 
(million grid 
points) 


SUNHPC 

10000 

Time steps/hr 


Perfirmance 

SGIR12000 
Origin 2000 

MFLOPS Time steps/hr MFLOPS 


1 


c 


138 


1.80E2 


181 


2.37E2 


32 


1 


2786 


3.64E3 


2877 


3.76E3 


48 


1 


3093 


4.04E3 


3545 


3.63E3 


64 


1 


2819 


3.69E3 


3694 


4.83E3 


72 


1 


N/A 


N/A 


4105 


5.37E3 


88 


1 


N/A 


N/A 


5087 


6.65E3 


1 


so*" 


2.3 


1.79E2 


2.3 


1.79E3 


32 


59 


51 


3.97E3 


59 


4.59E3 


48 


59 


74 


5.76E3 


73 


5.68E3 


64 


59 


83 


6.56E3 


91 


7.08E3 


72 


59 


N/A 


N/A 


101 


7.86E3 


88 


59 


N/A 


N/A 


128 


9.96E3 


104 


59 


N/A 


N/A 


131 


1.02EA 


112 


59 


N/A 


N/A 


144 


1.12EA 


120 


59 


N/A 


N/A 


150 


1.17EA 


124 


59 


N/A 


N/A 


153 


1.19BA 



’’The 1 million grid point test case consists of three zones with dimensions of 15H75H70, 



87H75H70, and 89H75H70. 

*’Xhe 59 million grid point test case consists of dime zones with dimensions of 29H450H350, 
173H450H350, and 175H450H350. 



6 Tools 

In paforming the serial tuning, the pnndple tools used were various pioilling tools. 
Initially, this meant using prof with and withoit pixie. Without pixie, prof would 
measure the actual ran time for the individual suhroutines. With pixie, prof would 
measure the theoretical rui time for ttie suhroutines, assuming an inilnitely fast 
memoiy system. By suhtiacting those two sets of nunljers, one can thoi estimate the 
cost of cache and TLB misses. Fortunately, most of tod^ s systems come with tools 
that allow one to directly measure those values (on systems lacking those tools, codes 
can he instrumented using PAPI, which is being developed at the University of 
Tennessee, Knoxville, TN, as a project for the PTOOLS organization). 

After tuningfor cache and TIB misses, a few loops hada low cache/TIB miss 
rate hut \rere still expensive to mn. Additiond hand optinizations guided hy 
assembly code dumps were performed to achieve a higher data reuse rate at the 
register level while eliminating unnecessary register spilling, pipeline stalls, and low 
instruction issue rates from excessive numbers of loads and stores. A key aspect of 
this phase of the tuning was to rm the program m as wide a rarge of RISC-based 
systems as possible to better undostand which optimizations would be of universal 
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value. Table 5 contains a list of systems used in this effort. Using this wide range of 
systems and compilers allowed tuning for a wider range of TLB and cache sizes. It 
also allowed us to better anticipate what types of codes current production compilers 
could handle without producing performance problems. 



Table 5. Systems used in tuning/parallelizing the RISC-optimized shared memory 
version of F3D. 



SGI R4400 based Challenge and Indigo 2 
SGI R8000 and RIOOOO based Power Challenges 
SGI RIOOOO and R12000 based Origin 2000s 
SUN SuperSPARC based SPARCcenter 2000 
SUN Ultra SPARC II based HPC 10000 

Convex HP PA-7100 based SPP-1000 and HP PA-7200 based SPP-1600 
HP PA-8500 based V-Class 



Another key aspect of this effort was to validate the results [7]. There were several 
stages to this effort, ranging from quick and dirty tests involving only a few time 
steps, to more elaborate tests performed on fully converged solutions, to finally a 
complete manual review of the code as part of a formal validation and verification 
exercise. Unfortunately, when performing some of the intermediate level tests, a 
mistake that had been introduced into the code would sometimes be found. One of the 
most useful tools for locating these mistakes was to update the version number of the 
code daily so that one could go back and find which was the first version to have the 
bug. One could then use diff to identify what the differences were between that 
version of the code and the previous one. In most cases, this was sufficient in rapidly 
identifying and fixing the bug (in all of the affected versions). One extreme example 
of applying this technique occurred early on. In reordering the indices of several key 
arrays throughout the program, changing almost every executable line of code in the 
entire program became necessary. Furthermore, all of the lines had to be changed at 
the same time. Unfortunately, the odds of getting this right proved to be vanishingly 
small. Additionally, manual inspection of the code failed to find the problems. 
Fortunately, after going through the entire exercise a second time, we were able to 
diff the two modified versions of the code, locate the mistakes, and get a code that 
produced the correct answers in about half the time as the previous version of the 
code. 

The principle tools used for evaluating the performance of the parallelized code 
were various versions of profiling tools (e.g.. Speed Shop on the SGI sytems, 
CXPERF on the Convex and HP systems, and Loop Tool on the SUN HPC 10000). 
The tools, in combination with simply timing the runs for various combinations of 
problem sizes, numbers of processors, and systems, allowed us to identify the issues 
that affected parallel speedup. The single biggest issue was that since parallelization 
was being done incrementally, we needed to know which loops were expensive 
enough to justify being parallelized (both in terms of the effort and additional 
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overhead that would be introduced). Once this was well in hand, the main remaining 
issue was to understand how the variable costs associated with memory latency and 
bandwidth in a NUMA environment were affecting the code performance. 
Unfortunately, experience indicated that not all NUMA platforms were created equal, 
and these performance problems on the Convex Exemplars were never satisfactorily 
solved. Fortunately, the results on the SGI system and the SUN HPC 10000 were 
much better, in part due to all of the work done to try to make the code perform better 
on the Convex Exemplar [6]. 



7 The Effect of the NUMA Architecture 

With memory latencies ranging from 310 945 NS in a 128 processor SGI Origin 
2000 [8], without using any form of out-of-order execution and/or prefetching, one 
will see a range of usable preprocessor bandwidths of 412 MB/second down to 
135 MB/second. Clearly for a shared memory program with a poor cache hit rate and 
a high proportion of off node accesses, this will result in a significant performance 
problem. One potential solution is to make use of out-of-order execution and/or 
prefetching to overlap cache misses in an attempt to achieve a lower effective 
memory latency. Unfortunately, the maximum per processor usable bandwidth for off 
node accesses is estimated to be only 195 MB/second, which severely limits the 
effectiveness of this approach. Our solution to this problem was to produce a highly 
tuned code with a low enough cache miss rate that the NUMA nature of the Origin 
2000 did not matter. According to the output of Perfex when our code is run on a 
180 MHz RIOOOO based Origin 2000 using a single processor, we have only 
68 MB/second of memory traffic. Since this is far less than the 135 195 MB/second 
of usable bandwidth for off node accesses on the Origin 2000, we have been able to 
treat the Origin 2000 as though it had Uniform Memory Access. Unfortunately, this 
approach did not work nearly as well on the more heavily NUMA systems such as the 
Convex Exemplar. 

This is not to say that the memory systems on the SGI Origin 2000 and the SUN 
HPC 10000 did not cause problems. The traditional shared memory systems (e.g., the 
Cray C90 or the SGI Power Challenge) store data in small blocks or lines, with 
successive blocks or lines being interleaved between multiple memory banks. On 
some systems, the size of a block may be as small as a single word (e.g., 8 bytes), 
while on most cache based systems, a cache line would be treated as a single block 
(e.g., 64 or 128 bytes). On systems which group memory and processors into nodes 
(e.g., the SGI Origin 2000, the SUN HPC 10000, and the hypernodes on the Convex 
Exemplar), the unit of interleaving becomes a page of memory (e.g., 4 16 KB ). In 
that case, one can easily have data from the same page being shared by multiple 
processors. In extreme cases, this will result in a severe amount of contention with a 
resulting drop in performance. It is important to note that no amount of page 
migration will solve this problem n either will data placement directives. Data 
replication/caching can help. But the best solution is to initially avoid the problem. It 
is also interesting to note that as far as the authors know, there are no tools 
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that will identify this problem. The best way to identify the problem now is to profile 
fixed size runs with varying numbers of processors and look for subroutines that are 
consuming additional CPU cycles as the number of processors increases. Using tools 
such as Perfex or Speedshop can determine if the number of cache misses is 
remaining relatively constant. If this is the case, then one almost certainly has a 
problem with contention. Example 4 illustrates the three cases (ideal, acceptable, and 
unacceptable) of concern. Please note that even though Example 4c is using a 
STRIDE-N access pattern to batch up the buffer, it can still have an acceptable cache 
miss rate. Unfortunately, the process of batching up the buffer can result in an 
unacceptable amount of contention on some of these systems. To confuse matters 
further, and for reasons that the authors were never able to fully explain, the SGI 
Origin 2000 and the SUN HPC 10000 exhibited this problem under different 
conditions, thereby making it necessary to eliminate all instances of this type of code 
from the program. 



DIMENSION A(JMAX,KMAX,LMAX) 
C$doacross local (J,K,L) 

DO 10 L=1,LMAX 

DO 10 K=1,KMAX 

DO 10 J=1,JMAX 
A(J, K, L) = ... 

10 CONTINUE 



(a) An example of the best possible access ordering. 



DIMENSION A(JMAX,KMAX,LMAX) 

C$do across local (K,J,L) 

DO 10 K=1,KMAX 

DO 10 L=1,LMAX 

DO 10 J=1,JMAX 
A(J, K, L) = ... 

10 CONTINUE 

(b) An example of an acceptable, but less desirable ordering. 



DIMENSION A (UMAX, KMAX, UMAX) , BUFFER(KMAX) 

C$doacross local { J, L , K, BUFFER) 

DO 20 J=1,JMAX 

DO 20 L=1,LMAX 

DO 10 K=1,KMAX 

BUFFER = A(J,K,L) 

10 CONTINUE 

DO 20 K=1,KMAX 

Perform extensive calculations using BUFFER 

20 CONTINUE 



(c) An example of an unacceptable ordering. 



Example 4. The effect of memory access patterns on contention. 
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8 Related Work 

The simplest approach to using loop-level parallelism is to use an automatic 
parallelizing compiler. Unfortunately, as Michael Wolfe (the author of High 
Performance Compilers for Parallel Computing, Addison-Wesley, 1996) has pointed 
out parallelizing compilers don t work and they never will [9]. A slightly more 
sophisticated approach has been suggested by Dixie Hisley of the U.S. Army 
Research Laboratory. This approach uses a combination of compiler directives and 
hand tuning to parallelize those expensive loops which the automatic parallelizing 
compilers are unable to handle on their own. The remaining loops are left for the 
compiler to figure out. In some cases, this was shown to increase the scalability of the 
code from 8 to 16 processors, with little or no additional work over the use of 
compiler directives (e.g., C$doacross) and hand tuning on their own. In contrast, 
using only an automatic parallelizing compiler in this case actually produced parallel 
slowdown [10]. 

An excellent comparison using the combination of hand tuning, automatic 
parallelization, and compiler directives, to the HPF and CAPTools parallelization 
tools was presented at SC98 [5]. It found that all three approaches had merit, although 
none would produce good results when used on a fully automatic basis. In many 
cases, the results from using compiler directives (e.g., C$doacross or CAPTools 
specific directives) were as good as those produced by hand parallelizing the code 
using MPI. 

Marek Behr attempted to parallelize the F3D program on the Cray T3D using the 
CRAFT programming model. Unfortunately, this effort had to be abandoned due to 
poor levels of performance (something that was a common complaint with this 
programming model) [11]. He then proceeded to manually parallelize the code using 
message passing calls (SHMEM on the Cray T3D, Cray T3E, and SGI Origin 2000, 
and MPI on other platforms) to implement loop-level parallelism. While this approach 
worked and produced a credible level of performance, it was significantly more 
difficult to implement. Furthermore, because many of the target platforms (e.g., the 
Cray T3D, Cray T3E, and IBM SP with the Power 2 Super Chip) had caches ranging 
in size from 16 128 KB, it was impossible to perform many of the cache 
optimizations that we performed using caches with 1 8 M B of memory [12]. 

An approach similar to the one we used in tuning and parallelizing F3D was used 
by James Taft a NASA Ames Research Center to tune and parallelize the ARC3D 
code for the SGI Origin 2000 [13]. More recently, Mr. Taft has used multiple levels 
of shared memory parallelism (MLP) an approach that he has successfully 
demonstrated with the Overflow code and more recently, several other commonly 
used codes at NASA Ames Research Center [14,15]. Straight loop-level parallelism 
and MLP appear to be complementary techniques, each with their own strengths and 
weaknesses. 

There is also a significant body of research involving the use of software 
distributed shared memory [16]. Unfortunately, there is generally a significant 
performance penalty when using these systems, which keeps them from being widely 
accepted for production environments. One of the key problems is that modem SMP 
and MPP systems usually have per processor memory bandwidths ranging from 
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200 MB/second to over 1 GB/second, with memory latencies of 100 100 0 NS. In 
contrast, the communications bandwidth for most workstation clusters and many 
MPPs is rarely much better than 100 MB/second on a per processor basis, with a 
latency that is frequently in the range of 50 100 microseconds for the better systems 
(Note that the primary exception to this rule is the Cray T3E when using SHMEM). 
Attempting to maintain coherency with the 128 byte granularity used in the SGI 
Origin 2000 with a latency of 100 microseconds results in a per processor bandwidth 
for off node accesses of 1.3 MB/second. Eor programs that are parallelized in more 
than one direction and therefore inevitably have a high level of off node memory 
accesses, this low level of performance is virtually impossible to overcome on today s 
high performance systems. Even in hardware implementations of a COMA 
environment, keeping all of the processors for a single job on a single node is highly 
desirable [17]. 



9 Conclusion 

It was demonstrated that careful tuning at the implementation level can make a 
significant difference in the performance of code running on RISC-based systems. 
Furthermore, it was shown that there are two key enabling technologies for this effort: 

1) Large caches. 

2) Access to a large amount of main memory, something that SMPs excel at. 
Additionally, it was shown that the use of loop-level parallelism can be of 

significant benefit when trying to parallelize certain classes of vectorizable code. 
However, several limitations to this approach were discussed: 

The need for sufficiently large and powerful SMPs. 

The importance of tuning the code for a high level of serial efficiency. 

The limitations on scalability (overhead, Amdahl’s Law, stair stepping). 

In summary, the combination of loop-level parallelism and SMPs represents a 
powerful tool for the parallelization of vectorizable programs. As long as everyone 
understands the inherent limitations, it should have a most productive future. 



Glossary 

CFD Computational Fluid Dynamics 

Cache a small high speed memory that sits between the processor and main 
memory, which is designed to store values that are likely to be needed by 
the processor in the very near future. 

CISC Complicated Instruction Set Computer 

COMA Cache Only Memory Architecture 

GFLOPS Giga Floating-Point Operations Per Second 

HPF High Performance Fortran 

NUMA Non Uniform Memory Access 
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MPI Message Passing Interface 

MPP Massively Parallel Processor 

RISC Reduced Instruction Set Computer 

SIMD Single Instruction Multiple Data 

SMP Symmetric Multiprocessor 

TLB Translation Lookaside Buffer 
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Abstract. The objects that occur in scientift applications can he classifed into 
spatial structures, e.g. meshes, grids, or graphs, and (numerical) data that are 
associated with these structures, e.g. grid functions and (sparse) matrices. Our 
C++ template library Janus rests on the observation that the spatial structures 
are conceptually more stable than the associated data. Janus provides a concep- 
tual framework and generic components for mesh-based scientift applications. 
An outstanding feature of Janus is its unifed treatment of regular and in'egular 
structures. Our library has been developed using the paradigm of generic pro- 
gramming and is portably implemented on top of the Standard Template Library. 
It runs on top of MPI, but it can also be put onto other parallel platforms. 



1 Introduction 

Many scientift computing applications, like fiiite element simulations, are based on 
spatial structures with which various data are associated. These spatial structures may 
be sets of points, edges or triangles, for instance, and are referred to as domains in this 
work. The associated data are simulation values, like the temperature at each point or 
the pressure in each cell. 

The coherence of domains is specifed by relations. There are different kinds of rela- 
tions: for instance nearest neighbor relations and hierarchical relations as they occur in 
multilevel methods. Relations and (sparse) matrices that are associated to them describe 
the communication that occur in scientift applications. 

The spatial structures of scientift applications are often irregular. Developing an 
architecture-neutral, application-oriented framework of reusable software components 
for them, is an important and very challenging task. PETSc |Q|, for example, is such a 
framework that runs on top of MPI. However, since it is written in the C programming 
language, application data types cannot be directly plugged into it. Providing a well- 
designed C++ template library would allow this. More application-oriented approaches, 
such as POOMA Q or SAMRAI lOi, limit themselves to (nested) grids or structured 
adaptive mesh refhement. 

Other C++ template libraries whose design follows the paradigm of generic pro- 
gramming can combine expressiveness, reusability, and, last but not least, efftienc y, 
which is of paramount importance for many scientift applications. Convincing exam- 
ples are the Matrix Template Library (MTL) ^3 ^nd the Generic Graph Component 
Library (GGCL) 0. 

F. Mueller (Ed.): HIPS 2001, LNCS 2026, pp. 450 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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Our C++ template library Janus applies this very promising approach to the task 
of developing reusable components for the efficient representation of potentially dis- 
tributed data structures of mesh-hased scientift applications. Janus rests on the obser- 
vation that the spatial structures are prior to the associated data and also more stable 
than these. 

An important part of applying the paradigm of generic programming to a particular 
application domain is concept development. In section Q we give an overview of the 
three fundamental concepts of Janus — Domain, Relation, and Domain Function. The 
main template classes of Janus are presented in sectionQ There, we also explain how 
the concepts of Two Phase Domain and Two Phase Relation allow the efftient imple- 
mentation of data structures for irregular problems. In section Q we sketch how Janus 
components can be used for the parallel implementation of a two-dimensional adaptive 
fiiite element simulation. Our implementation also includes the repartitioning of the 
locally reined mesh to ensure that the computational load is balanced. 

2 Conceptual Framework 

Janus is a C++ template library that simplifes the fast and efftient implementation of 
Ihite element and other mesh or grid-based methods. Janus provides not only reusable 
components but also an extensible conceptual framework to deal efftiently with com- 
plex structures of scientift computations. 

The design of Janus is based on the paradigm of generic programming \ lUl that 
became well-known through the C++ Standard Template Library (STL) M- 

2.1 Main Concepts of Janus 

In contrast to the STL, we do not deal with fundamental data structures and algorithms 
of computer science. Rather, we devise concepts that are useful for the representation 
of sets, their relations, and the data defiled on them. In particular, our concepts shall 
ft the requirements of parallel adaptive mesh methods, that is, they must allow fast and 
flexible implementations. 

The three fundamental concepts of Janus are Domain, Relation, and Domain 
Function. 

The concepts Domain and Relation contain general requirements that are neces- 
sary to describe spatial structures. Basically, a domain is a fiiite set represented as a 
sequence. Every element of a domain has a unique position (its sequence index) by 
which it can be represented. This has several important consequences. 

It allows to represent data that are associated with a domain by a random access 
container of the same size as the underlying domain. This property is utilized in the 
concept Domain Function of Janus. 

Relations between domains can be represented by adjacency matrices. A key point 
of the Janus Relation concept is that it maintains both the application-oriented view 
of a relation and its efftient linear-algebraic representation. We require that Relation 
provides several methods (e.g. pull, push) for communicating data associated with 
the domains of a relation. The semantics of these methods is based on matrix vector 
multiplications. 
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2.2 One Phase and Two Phase Structures 

To deal with different dynamic requirements to domains, we distinguish between One 
Phase Domains and Two Phase Domains. This differentiation applies also to the con- 
cept Relation. 

Objects that represent complex, i.e., irregular spatial structures usually cannot be 
completely described at initialization time. They need an insertion phase (as a kind of 
extended initialization). In order to enable an efficient implementation, we require that 
these elements contained in such objects may only be retrieved after the insertion phase 
has been completed. Thus, we have a second phase (called retrieval phase) for these 
objects and we generally refer to them as two-phase structures. Two-phase structures 
have a method freeze whose call marks the transition from the insertion phase to the 
retrieval phase. 

We use the idea of one and two-phase structures to deffae refinements of our Do- 
main and Relation concepts. Their requirements and practically interesting models are 
outlined in the following section. 

Simple (regular) spatial structures, like rectangular grids, can be completely de- 
scribed at initialization time. There is no need of an insertion phase for them and their 
elements can be immediately retrieved. Since they have only a retrieval phase but no 
insertion phase, we refer to them as one-phase structures. 



3 The C++ Template Library Janus 

The template classes of Janus are models of the concepts that have been outlined in 
sectionQ The components of Janus can be used to efftiently implement a variety of 
applications ranging from simple fiiite difference methods on rectangular grids to adap- 
tive mesh refhement codes. FigureQgives an overview of the template classes currently 
offered by Janus. 




Fig. 1. Components of Janus. 



Since Janus concepts have only been informally introduced, we combine the intro- 
duction of the main template classes with a short speciftation of the most important 
concept requirements. More details can be found in the technical report lEI- 
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Most of the components of Janus can be used both within strictly sequential and 
parallel programming contexts. For the latter, it is assumed that Janus is part of an 
MPI 1^3 program. A user of Janus must specify at compile time in which mode the 
library shall be used. For this, there are the C-preprocessor ftgs .JANUS _SEQ_ and 
.JANUSjyiPI.. 

3.1 Domain Classes 

The concept Domain describes a set of n different objects that are numbered from 0 to 
n — 1. So in fact, a domain ZJ is a finite sequence 

(do, di, ■ . . , dji—i) (1) 

with the additional requirement that the mapping 1 1 — > di is one-to-one. 

A type X that is a model of the concept Domain must provide the nested types 
value.type and size.type and the methods size, operator [] (returns the 
element) and position (the inverse method to operator [] ). The operator [] 
must be a constant time operation. The method position shall not be more complex 
than 0(log(n)) where n is the number of elements of the domain. 

Janus currently offers two template classes that are models of the concept Domain. 
They are the grid<N> and domain<T> (see subsection^^ template classes. 

The template class domain can be used for irregular domains whose elements are 
explicitly enumerated. This template is a two-phase domain, which means that there is 
an explicit insertion phase. 

When an element is inserted into a domain, mapping information can be specifed 
(using method insert_at). The elements are distributed according to this mapping 
only when the freeze method is called. Therefore, only few large messages have to 
be sent instead of many small ones. This shows how two-phase structures enable the 
efffcient implementation of distributed data structures. 

The grid template is a one-phase domain and describes the Cartesian product of N 
integer intervals. Note that grid<N> is an example for a domain that is not a container, 
i.e., it does not store its elements. Its value type is svector<int , N>. The template 
svector<T , N> is a special utility class of Janus that represents (short) vectors whose 
size is known at compile time (see section^3l. 



3.2 Relation Classes 

The concept of a Relation R between two domains is based on the mathematical concept 
of a relation between the two sets: A relation is a subset of the cross product of two sets. 
Since we have a one-to-one correspondence between the domain elements and their 
positions, we can represent a relation by its adjacency matrix Mr. 

We give a very short description of currently three relation classes available: sten- 
cil for regular relations on grids and the more general types njtrelation and re- 
lation (see section^3. 
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Relations and Communication. The communication patterns of scientift applications 
are closely related to the adjacency matrix Mr. We only mention the sparsity pattern 
of the system matrix of a fhite element method or the interpolation and restriction 
operations within mesh hierarchies. In Janus, this knowledge is formalized and put into 
the requirements of the Relation concept. 

Beside the fact that Relation is a refiiement of the concept domain (its value type 
is ‘pair of integers’), we require that a model of the concept relation provides several 
matrix- vector operations. The most important of them are pull (multiplication with 
Mr) and push (multiplication with the transpose of Mr). 

First of all, we note that pull and push are member templates (a relatively new 
feature of C++) of the relation classes. The parameters of the two template member 
methods must provide random access to elements of domain functions. Typically, they 
will hold objects of type int, double, or coinplex<f loat>. 

Janus also requires the more general pull_matrix, push_matrix, gather 
and scatter template members by a model of the concept Relation. The difference 
between pull and pull .matrix is that pull multiplies with the adjacency matrix 
(which has only entries such as 0 or 1) whereas pulljnatrix uses a matrix with 
arbitrary coefftients on the pattern of the relation. The same holds for the transpose 
matrix operations push and pushjnatrix, respectively. The gather method lifts 
data that are associated with a domain onto a relation. In contrast, the scatter method 
projects data associated to a relation onto second domain factor of the relation. 

The template class n_relation<doml , dom2 , Gen> represents a relation be- 
tween two domains that can be described by the function object Gen. By regular, we 
mean a relation between two domains where each element of the ftst domain is in re- 
lation with a & ed number n of elements of the second domain. The adjacency matrix 
of such a relation is more compact than in the case of relation. A typical example 
is the triangle-edge relation of a triangulation: each triangle has exactly three edges. 

The template class relation<doml , dom2 > describes a potentially irregular re- 
lation between two domains. The relation template is a model of the concept Two- 
Phase Relation. This means that it provides an extended initialization phase that is &- 
ished only when the freeze method is called. Internally, the relation is represented 
using in the compressed row storage (CRS) format which allows an efftient sparse 
matrix vector multiplication, i.e. the pull and push methods. 

3.3 Domain Function Classes 

Spatial structures are represented by domains and relations between them. The core 
of scientift computations, however, is usually performed with data that are associated 
with these spatial structures. The reader shall think for example of pressure and veloc- 
ity values that are related to grid components and to (sparse) matrices associated with 
relations. 

The concept Domain Function describes objects of a type T that are associated to the 
elements of a domain d. If d consists of n elements then a domain function d provides 
random access to n objects of T. 

This defiiition exploits the fact that there is a one-to-one relation between the ele- 
ments of a domain and their positions. It also allows to use various types of data struc- 
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tures to represent a domain function, e.g., it is possible to use a C-array or the standard 
container type std : ; vector<T> to represent a domain function of values of type T. 

Janus provides the template class array<Dom, T> for domain functions related to 
proper domains and the (sparse) matrix template matrix<Rel , T>. 



4 A Sample Finite Element Problem 

In this section, we show in some detail how a two-dimensional fiiite element method 
can be implemented using components of the Janus framework. The complete descrip- 
tion can be found in 

For sake of simplicity, we consider the homogeneous Poisson problem. We de- 
ploy linear triangle fiiite elements which means that the domain of the boundary value 
problem is subject to a triangulation and that the corresponding fiiite element space is 
spanned by functions that are continuous and element-wise linear polynomials. 

We use a cascadic conjugate gradient method 1 ^ | to solve the discretized problem. 
This method works with a hierarchy of meshes. We obtain this hierarchy by successive 
adaptive refnement that rests on a simple gradient error indicator. The coarse algorith- 
mic structure of our solution method can be seen in fgure Q 



void cascadic_cg () 



for (k= 0; discretization_error () > eps_disc; k++) 



mesh [k] . 

solve directly () 


mesh [k] .x= prolongate (mesh[k-l] .x) 


mesh [k] .eg start () 


iteration error () > mesh[k].eps 


mesh [k]. eg iteration () 


mesh[k4-l]= refine (mesh[k]) 



Fig. 2. Nassy Shneiderman diagram of the cascadic CG method. 



As the reader can see in fgure Q there are several numerical operations performed 
between two mesh refnement steps. This is a good example for the relative stability of 
the spatial structures compared to the data defhed on them. 

In the following subsections we have a closer look at the various computational 
steps performed in our fiiite element method. 



4.1 Construction of the Finite Element Space 

The construction of the fiiite element space requires at fist a triangulation of the given 
domain. An example is given in fgure Q 
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Fig. 3. Test domain that was triangulated using Triangle IDl- 



Vertex and Triangle Sets. For the case at hand, the mesh is given by the set of triangles 
T and the set of vertices V. The node set N, i.e. the set that indexes the basis functions 
of the fiiite element space coincides for linear triangle elements with the set of vertices 
V. The two sets T and V are needed to defhe the functions that occur during the 
computations. 

One way to represent mesh components is to use natural numbers to name the ver- 
tices and to represent a triangle as the triple of its vertices. This representation is also 
used by the mesh generator Triangle o . To represent triangles, we use the svector 
template class of Janus (see section^3l. Since both the vertices and triangles have to 
be explicitly stored, we use the domain template (see section^^ to defhe types that 
represent the sets of vertices and triangles of our mesh. 



Important Mesh Relations. Various relations between the discrete sets T and V must 
be maintained since they are used to express the computations that involve functions 
on the different sets. For the problem at hand, we need the following two relations: the 
triangle-vertex relation Rtv which assigns to each triangle its three vertices and the 
nearest-neighbor relation Ryv- For the representation of the triangle-vertex relation, 
we can deploy the template class n_relation (see section^3l. This relation can 
be set up in the constructor by a user-specifed generator. Here the generator trian- 
gle_vertex_gen implements the association of the three vertices of a triangle with 
this triangle. 

The nearest-neighbor relation Rvv is fundamental for the description of the discrete 
operator An, i.e. this relation defnes the matrix sparsity of the linear system. This 
relation is defned as the set of all pairs of nodes for which an element (i.e. triangle) 
exists that contains both nodes 

2 

Rvv = u u {{vi,Vj) I (t,vo,vi,V2) e Rtv}- (2) 

teT i,j=0 



This relation is stored as Janus’ most general relation class relation and set up by 
inserting explicitly each pair of vertices that belongs to the same triangle. 
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4.2 Solution of the Linear System 

To solve the linear equations, we use the diagonally scaled CG method. As a side effect 
of the adaptive mesh refiiement, we get a hierarchy of linear equations, which can be 
utilized in multilevel solvers. For example, we can expand the CG method to a cascadic 
CG method by using the solution of the linear equation on a certain mesh as an initial 
guess on the next filer mesh (resp. the interpolated vector). This starting vector lies 
close to the solution, and the remaining error components are of high frequency (cf. [0 |)j 
i.e., they can be reduced with high convergence and the number of CG iterations on the 
individual levels is limited. 

In the following, we present fist performance results. TableOshows the mesh sizes 
over three levels of refiiement. 

Table 1. Progression of mesh sizes through three levels of refiiement. 



Mesh 


Vertices Triangles 


Edges 


Initial 


4,078 


7,846 


11,924 


Level 1 


7,730 


14,873 


22,603 


Level 2 


14,036 


27,062 


41,098 


Level 3 


33,052 


64,370 


97,422 


Level 4 


67,967 


133,232 


201,199 


Level 5 


156,891 


309,126 


466,017 


Level 6 


314,857 


623,033 


937,89C 


Level 7 


606,866 


1,203,944 


bo 

p 

bo 

_Q_ 



We ran our application from 1 to 64 nodes of the 64 node Linux PC cluster of 
the Real World Computing Partnership. Each node is equipped with a Pentium III 800 
MHz processor and 512 MB memory and a Myrinet card. We used MPICH on top of 
the user-level communication library PM lit I .SI that utilizes the Myrinet boards and uses 
a zero-copy protocol. 

TableQshows the overall execution time in seconds for the solution of the Poisson 
problem. Note that our times include not only the times for the three refiiement, par- 
titioning, and remapping steps but also the computation of the element matrices, their 
assemblage and the execution of conjugate gradient methods on all levels. On a single 
processor we ran a completely sequential version of our program which of course needs 
no mapping algorithms at all and which contains no overhead from the parallelization. 

Two different mesh partitioners were used to compute the new partition 
of the adaptively refiied mesh, namely the graph partitioner ParMetis [JU (the 
ParMETIS-RepartMLRemap function) and our in-house development BHT (Bal- 
anced Hypersphere Tessellation) Q. Figure0shows the different partitions of a mesh 
with BHT that correspond to two refiiement steps. In our actual computations, we 
started with a fher subdivided mesh. 

For the valuation of the results, one has to keep in mind that the application is highly 
irregular and dynamic. This table also indicates that for adaptively refiied meshes, our 
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Fig. 4. Different steps of refnement (both domains are partitioned into 8 subdomains). 
Table 2. Overall execution time of our adaptive fiiite element method. 



Nodes 


1 2 4 8 16 32 64| 


BHT 

ParMetis 


121.7 78.5 39.6 20.4 10.8 6.8 7.5| 
121.7 83.2 41.8 21.6 11.8 7.4 6.3| 



coordinate-based partitioning algorithm (BHT) can compete with state-of-the-art graph 
partitioners. A further advantage of BHT is its low memory consumption. 



5 Conclusions and Future Work 

Janus provides a small yet very expressible set of concepts and template classes for the 
implementation of mesh based scientift applications. An important feature of Janus is 
its unifed treatment of regular and irregular spatial structures. Two phase domains and 
relations make highly ft xible yet slow classical data structures (trees, lists) dispensable 
for the implementation of irregular scientift problems. This makes it very attractive to 
use our library for the implementation of adaptive mesh methods. 

The main components of Janus, i.e., its domain, relation, and domain function 
classes are very much application-oriented abstractions. As a consequence, the whole 
program development is raised to a higher level. Communication, for example, is ex- 
pressed by pull and push operations of relation objects. Janus provides efftient imple- 
mentations of these operations both for parallel and sequential platforms. Thus, these 
low-level aspects of scientift programming cease to be a problem for the application 
programmer. 

Janus is not only used for the implementation of two-dimensional mesh methods. 
An important feature of Janus is that its abstractions are essentially dimension free. The 
Fracture Group at the Theory Center of Cornell University utilizes Janus to represent 
the mesh components of a three-dimensional fhite element crack simulation. It has been 
shown in this project that Janus can interoperate with other libraries, since the linear 
systems are solved with PETSc after being constructed with Janus. To support this 
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interoperability, we plan to develop a Janus interface to PETSc. We are also working 
on an implementation of Janus for shared memory systems on top of OpenMP. More 
information about Janus (including the source code) can be obtained from the web site 
httD : / / WWW. first . amd . cte/ 1 anus 
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Abstract. DSM-PM2 is a platform for designing, implementing and experi- 
menting with multithreaded DSM consistency protocols. It provides a generic 
toolbox which facilitates protocol design and allows for easy experimentation 
with alternative protocols for a given consistency model. DSM-PM2 is portable 
across a wide range of clusters. We illustrate its power with figures obtained for 
different protocols implementing sequential consistency, release consistency and 
Java consistency, on top of Myrinet, Fast-Ethernet and SCI clusters. 



1 Introduction 

In their traditional flavor. Distributed Shared Memory (DSM) libraries 
allow a number of separate processes to share a common address space using a con- 
sistency protocol according to a semantics specified by some given consistency model: 
sequential consistency, release consistency, etc. The processes may usually be phys- 
ically distributed among a number of computing nodes interconnected through some 
communication library. The design of the DSM library is often highly dependent on 
the selected consistency model and on the communication library. Also, only a few of 
them are able to exploit the power of modern thread libraries to provide multithreaded 
protocols, or at least to provide thread-safe versions of the consistency protocols. 

Most approaches to DSM programming assume that the DSM library and the un- 
derlying architecture are flxed, and that it is up to the programmer to fit his program 
with them. We think that such a static vision fails to appreciate the possibilities of 
this area of programming. We believe that a better approach is to provide the applica- 
tion programmer with an implementation platform where both the application and the 
multithreaded DSM consistency protocol can possibly be co-designed and tuned for 
performance. This aspect is crucial if the platform is used as target for a compiler: the 
implementation of the consistency model through a specific protocol can then directly 
benefit from the specific properties of the code, enforced by the compiler in the code 
generation process. The platform should moreover be portable, so that the programmers 
do not have to commit to some existing communication library or operating system, or 
at least be able to postpone this decision as late as possible. 

DSM-PM2 is a prototype implementation platform for multithreaded DSM pro- 
gramming which attempts to meet these requirements. Its general structure and pro- 
gramming interface are presented in SectionQ SectionQdiscusses in more detail how to 
select and define protocols. A given consistency model can be implemented via multiple 
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alternative protocols. We give an overview of the implementation of several protocols 
for various consistency models, including sequential consistency, release consistency 
and Java consistency (which is a variant of release consistency). In particular, two al- 
ternative protocols addressing the sequential consistency model are described, a first 
one based on page migration, and the second one using thread migration, as enabled 
by the underlying multithreading library. Finally, we Illustrate the portability and effi- 
ciency of DSM-PM2 by reporting performance measurements on top of different clus- 
ter architectures using various communication interfaces and interconnection networks: 
BIP ^3/Myrinet, TCP/Myrinet, TCP/FastEthernet, SISCFSCI 

Related Work 

The concept of Distributed Shared Memory was proposed more than a decade ago 03. 
Important efforts have been subsequently made to improve the performance of software 
DSM systems and many such systems were proposed to illustrate new ideas. Progresses 
related to the relaxation of consistency protocols were illustrated with Munin |[F] (for 
release consistency), TreadMarks Q (to study the impact of laziness in coherence prop- 
agation, through lazy release consistency), Midway [)J (for entry consistency), and Bra- 
zos 1^3 (for scope consistency). Recent software DSM systems, such as Millipede o, 
CVM ^3 2 nd Brazos integrate the use of multithreading. 

Our work is more closely related to that of DSM-Threads |Q|, a system which ex- 
tends POSIX multithreading to distributed environments by providing a multithreaded 
DSM. Our approach is different essentially by the generic support and the ability to 
support new, user-defined consistency protocols. Millipede m also integrates threads 
with Distributed Shared Memory. It has been designed for a specific execution environ- 
ment (Windows NT cluster with Myrinet) and focuses on sequential consistency only. 
CVM ^3 is another software DSM system which provides multithreading (essentially 
to hide the network latency) and supports multiple consistency models and protocols. 
However, CVM’s communication layer targets the UDP protocol only, whereas DSM- 
PM2 captures the benefits of PM2’s portability on a large variety of communication 
interfaces: it is currently available on modern Myrinet and SCI high-performance clus- 
ters run with Linux. The primary goal of DSM-PM2 is to provide a portable platform 
for easy protocol experimentation. Its customizability makes it also valuable as a target 
for compilers as the Java Hyperion compiler discussed in Section^3 



2 DSM-PM2: An Overview 

2.1 The PM2 Runtime System 

PM2 (Parallel Multithreaded Machine) |1Q| is a multithreaded environment for dis- 
tributed architectures. It provides a POSIX-like interface to create, manipulate and syn- 
chronize lightweight threads in user space, in a distributed environment. Its basic mech- 
anism for inter-node interaction is the Remote Procedure Call (RPC). Using RPCs, the 
PM2 threads can invoke the remote execution of user-defined services. Such invoca- 
tions can either be handled by a pre-existing thread, or they can involve the creation 
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of a new thread. While threads running on the same node can freely share data, PM2 
threads running on distant nodes may only interact through RPC. This mechanism can 
he used either to send/retrieve information to/from the remote node, or to have some 
remote action executed. The minimal latency of a RPC is 6 /rs over SISCI/SCI and 8 /iS 
over BIP/Myrinet on our local Linux clusters. 

PM2 includes two main components. For multithreading, it uses Marcel, an effi- 
cient, user-level, POSIX-like thread package. To ensure network portability, PM2 uses 
an efficient communication library called Madeleine |Q, which was ported across a 
wide range of communication interfaces, including high-performance ones such as 
BIP SISCI, VIA 101, as well as more traditional ones such as TCP, and MPI. 

An interesting feature of PM2 is its thread migration mechanism that allows threads 
to be transparently and preemptively moved from one node to another during their ex- 
ecution. Such a functionality is typically useful to implement generic policies for dy- 
namic load balancing, independently of the applications: the load of each processing 
node can be evaluated according to some measure, and balanced using preemptive mi- 
gration. The key feature enabling preemptiveness is the iso-address approach to dy- 
namic allocation featured by PM2. The isomalloc allocation routine guarantees that 
the range of virtual addresses allocated by a thread on a node will be left free on any 
other node. Thus, threads can be safely migrated across nodes: their stacks and their 
dynamically allocated data are just copied on the destination node at the same virtual 
address as on the original node. This guarantees the validity of all pointers without any 
further restriction [Q|. Migrating a thread with a minimal stack and no attached data, 
takes 62 /rs over SISCI/SCI and 75 /xs over BIP/Myrinet on our local Linux clusters. 

2.2 DSM-PM2: Towards a Portable Implementation Platform 

DSM-PM2 provides the illusion of a common address space shared by all PM2 threads 
irrespective of their location and thus implements the concept of Distributed Shared 
Memory on top of the distributed architecture of PM2. But DSM-PM2 is not simply a 
DSM layer for PM2: its goal is to provide a portable implementation platform for mul- 
tithreaded DSM consistency protocols. Given that all DSM communication primitives 
have been implemented using PM2’s RPC mechanism based on Madeleine, DSM-PM2 
inherits PM2’s wide network portability. However, the most important feature of DSM- 
PM2 is its customizability: actually, the main design goal was to provide support for 
implementing, tuning and comparing several consistency models, and alternative pro- 
tocols for a given consistency model. 

As a starting remark, we can notice that all DSM systems share a number of com- 
mon features. Every DSM system, aimed for instance at illustrating a new version of 
some protocol, has to implement again a number of core functionalities. It is therefore 
interesting to ask: What are the features that need to be present in any DSM system? 
And then: What are the features that are specific to a particular DSM system? By an- 
swering these questions, we become able to build a system where the core mechanisms 
shared by the existing DSM systems are provided as a generic, common layer, on top 
of which specific protocols can be easily built. In our study, we limit ourselves to page- 
based DSM systems. 
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Access Detection. Most DSM systems use page faults to detect accesses to shared 
data, in order to carry out actions necessary to guarantee consistency. The generic 
core should provide routines to detect page faults, to extract information related to 
each fault (address, fault type, etc.) and to associate protocol- specific consistency 
actions to a page-fault event. 

Page Manager. Page-hased DSM systems use a page table which stores information 
about the shared pages. Each memory page is handled individually. Some informa- 
tion fields are common to virtually all protocols: local access rights, current owner, 
etc. Other fields may be specific to some protocol. The generic core should pro- 
vide the page table structure and a basic set of functions to manipulate page entries. 
Also, the page table structure should be designed so that new information fields 
could be added, as needed by the protocols of interest. 

DSM Communication. We can notice that the known DSM protocols use a limited 
set of communication routines, like sending a page request, sending a page, send- 
ing diffs (for some protocols implementing weak consistency models, like release 
consistency). Such a set of routines should also be part of the generic core. 

Synchronization and Consistency. Weaker consistency models, like release, en- 
try, or scope consistency require that consistency actions be taken at synchroniza- 
tion points. In order to support these models, the generic core should provide syn- 
chronization objects (locks, barriers, etc.) and enable consistency actions to be as- 
sociated to synchronization events. 

Thread-Safety. Modern environments for parallel programming use multithreading. 
All the data structures and management routines provided by the generic core 
should be thread-safe: multiple concurrent threads should be able to safely call 
these routines. 



A closer study of page-based consistency protocols enables to list up a small num- 
ber of events which should trigger consistency actions: page faults, receipt of a page 
request, receipt of the requested page, receipt of an invalidation request. Additionally, 
for weak consistency models, lock acquire, lock release and barrier calls are events to 
be associated with consistency actions. In the current version of DSM-PM2, there are 
8 actions. The detailed list is given in TableD 

Table 1 . DSM-PM2 protocol actions. 



Protocol function 


Description 


read fault handler 


Called on a read page fault 


write fault handler 


Called on a write page fault 


read server 


Called on receiving a request for read access 


write server 


Called on receiving a request for write access 


invalidate server 


Called on receiving a request for invalidation 


receive page server 


Called on receiving a page 


lock_acquire 


Called after having acquired a lock 


lock_release 


Called before releasing a lock 
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Once the generic core has been delineated, we can consider building consistency 
protocols on top of it. Designing a protocol in DSM-PM2 consists in providing a set of 
8 routines, one for each action identifed above. These routines are designed using on 
the API of the generic components. They are automatically called by DSM-PM2, and 
nothing more has to be done by the programmer. According to our personal experience, 
the code for the routines is quite manageable: a few hundreds of lines for the whole set 
of routines of a typical protocol. A key feature of DSM-PM2 is that all the mechanisms 
provided by the generic core are thread-safe. The task of the protocol designer is thus 
considerably alleviated as most (if not all!) subtle synchronization problems are already 
addressed by the core routines. 



DSM-PM2 




DSM protocol policy 








DSM protocol lib 








DSM page manager DSM comm 





Fig. 1. Overview of the DSM-PM2 software architecture. 
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As a consequence of our distinction between generic core mechanisms and protocol- 
specift actions, DSM-PM2 is structured in layers (Figure^- At the lowest level, DSM- 
PM2 includes two main components which make up the the main part of the generic 
core: the DSM page manager and the DSM communication module. Both are based on 
the API of PM2: no direct access to the thread low-level structures and to the underlying 
communication library are made. 

The DSM page manager is essentially dedicated to the low-level management of 
memory pages. It implements a distributed table containing page ownership information 
and maintains the appropriate access rights on each node. This table has been designed 
to be generic enough so that it could be exploited to implement protocols which need a 
fk ed page manager, as well as protocols based on a dynamic page manager (see for 
a classiftation of page managers). Of course, each protocol uses the felds in the page 
entries of the table as required by its corresponding page management strategy (which 
is decided at the higher, protocol library level). Consequently, a feld may have different 
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semantics in different protocols and may be even left unused by some protocols. Also, 
new felds could be easily added if needed in the future. 

The DSM communication module is responsible for providing elementary commu- 
nication mechanisms, such as delivering requests for page copies, sending pages, inval- 
idating pages or sending diffs. This module is implemented using PM2’s RPC mech- 
anism, which turns out to be well-suited for this kind of task. For instance, requesting 
a copy of a remote page for read access can essentially be seen as invoking a remote 
service. On the other hand, since the RPCs are implemented on top of the Madeleine 
communication library, the DSM-PM2 communication module is portable across all 
communication interfaces supported by Madeleine at no extra cost. 

The routines which compose a protocol are defhed using a toolbox called the DSM 
protocol library layer. It provides routines to perform elementary actions such as bring- 
ing a copy of a remote page to a thread, migrating a thread to some remote data, inval- 
idating all copies of a page, etc. All the available routines are thread-safe. This library 
is built on top of the two base components of the generic core; the DSM page manager 
and the DSM communication module. 

Finally, at the highest level, a DSM protocol policy layer is responsible for building 
consistency protocols out of a subset of the available library routines. An arbitrary num- 
ber of protocols can be defiled at this level, which may be selected by the application 
through a specifi; library call. Some classical protocols are already built-in, as summa- 
rized in Table O but the user can also add new protocols, as described in Section^3 
by defiling each of the component routines of each protocol and by registering it using 
specifi; library calls. 

2.3 Using Protocols 

Table 2. Consistency protocols currently available in the DSM-PM2 library. 



Protocol 


Consistency 


Basic features 


li hudak 


Sequential 


MRSW protocol. Page replication on read access, page 
migration on write access. Dynamic distributed manager. 


migrate thread 


Sequential 


Uses thread migration on both read and write faults. Fixec 
distributed manager. 


ere sw 


Release 


MRSW protocol implementing eager release consistency. 
Dynamic distributed manager. 


hbre mw 


Release 


MRMW protocol implementing home-based lazy release 
consistency. Fixed distributed manager. Uses twins anc 
on-release diffhg. 


j ava_ic 


Java 


Home-based MRMW protocol, based on explicit inline 
checks (ic)for locality. Fixed distributed manager. Uses 
on-the-l|f diff recording. 


java pf 


Java 


Home-based MRMW protocol, based on on page 
faultsipf ). Fixed distributed manager. Uses on-the-l|^ difl 
recording. 
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#include "pm2.h" 

BEGIN_DSM_DATA 
int X = 34; 

/* ■ ■ ■ */ 

END_DSM_DATA 

void main (void) 

{ 

/* Use the built-in 'li_hudak' protocol */ 

pm2_dsm_set_default_protocol (li_hudak) ; 
pm2_init ( ) ; 

X++ ; 

/* ■ • ■ */ 

} 



Fig. 2. Using DSM-PM2 with a built-in protocol. 



In DSM-PM2, a specific protocol is a set of actions designed to guarantee con- 
sistency according to a consistency model. In our current implementation, a protocol 
is specified through 8 routines (listed in Table that are automatically called by the 
generic DSM support as needed. Each protocol is labeled by a unique identifier. This 
identifier can for instance be used to set it up as the default protocol or to associate it to 
dynamically allocated shared objects. DSM-PM2 protocols can be specified and used 
in three different ways. 

Using Built-in Protocols. The easiest way consists in selecting one of the available 
built-in protocols. In its current stage of development, DSM-PM2 provides 6 such 
protocols, whose main characteristics are summarized in Table Qand detailed in 
SectionQ On FigureQ the 1 i_hudak protocol is declared as the default protocol 
for the static shared area. 

Building New Protocols. The user can also a define a new protocol by providing 
each of its component routines and by registering it using a specific library call. 
The newly created protocol can then be used exactly in the same way as built-in 
protocols. 

int new_proto; 

new_prot = dsm_create_protocol 
( read_f ault_handler , write_f ault_handler , 
read_server, write_server , 
invalidate_server , receive_page_server , 
acquire_handler , release_handler ) ; 

pm2_dsm_set_def ault_protocol (proto) ; 
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Building Protocols Using Library Routines. A mixed approach consists in us- 
ing existing library routines, as provided in the DSM protocol library layer, rather 
than new, user-defined routines, but combine then in some ad-hoc way. One may 
thus consider hybrid approaches such as page replication on read fault (like in the 
li_hudak protocol) and thread migration on write fault (like in the 
migrate_thread protocol). One may even embed a dynamic mechanism se- 
lection within the protocol, switching for instance from page migration to thread 
migration depending on ad-hoc criteria. However, the user is responsible for using 
these features in a consistent way to produce a valid protocol. 

Observe that no pre-processing of the code file is used. Consequently, it is possible 
to define a number of protocols in a program and to dynamically select one of them 
according to the arguments provided by the user without any recompilation: 

int protol, proto2 ; 

protol = dsm_create_protocol (...); 

proto2 = dsm_create_protocol (...); 

if (...) pm2_dsm_set_default_protocol (protol) ; 
else pm2_dsm_set_default_protocol (proto2 ) ; 

Again, the built-in protocols are just pre-defined protocols, so they can freely be in- 
cluded in such a selection. 

On FigureO a protocol is associated to a static memory area. DSM-PM2 also pro- 
vides dynamic allocation for shared memory. Each such dynamically-allocated shared 
area can be managed with a specific protocol, which can be specified through its cre- 
ation attribute as illustrated. (Otherwise, the default protocol set by 
pm2_dsm_set_def ault_protocol is used.) Consequently, different DSM pro- 
tocols may be associated to different DSM memory areas within the same application. 

#define N 128 
int *ptr; 
dsm_attr_t attr; 

dsm_attr_set_protocol (&attr, li_hudak) ; 
ptr = (int*) dsm_malloc (N*sizeof (int) , &attr) ; 

In the current version of the system, DSM-PM2 does not provide any specific sup- 
port to dynamically switch the management of a memory area from one protocol to 
another one within the same run. However, this can be achieved if needed through a 
careful synchronization at the program level (e.g. through barriers). Essentially, one has 
to keep the corresponding memory area from being accessed by the application threads 
during the protocol switch, since this operation involves modifications in the distributed 
page table on all nodes. 

DSM-PM2 provides a multithreaded DSM interface: static and dynamic data can 
be shared by all the threads in the system. Since the programming interface is intended 
both for direct use and as a target for compilers, no pre-processing is assumed in the 
general case and accesses to shared data are detected using page faults. Nevertheless, 
when DSM-PM2 is used as a compiler target, accesses to shared data may be carried 
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out through specific runtime primitives like get and put (and not through direct as- 
signment). The implementation of these primitives may then explicitly check for data 
locality and handle consistency accordingly. DSM-PM2 thus provides a way to bypass 
the page fault detection and to directly activate the protocol actions. 

3 Built-in Protocols Available with DSM-PM2 

Currently, DSM-PM2 provides 6 built-in protocols, whose main characteristics are sum- 
marized in TableO All these protocols share two important common features. 1) Their 
implementations are multithreaded: it uses multiple “hidden” threads to maintain the 
internal data structures and to enhance reactivity to external events such as message 
arrival. 2) They are thread-safe: an arbitrary number of user-level threads can concur- 
rently access pages on any node and threads on the same node safely share the same 
page copy. These distinctive features required that the traditional consistency proto- 
cols (usually written for single-threaded systems) which we used as a starting point be 
adapted to a multi-threaded context to handle thread-level concurrency. As opposed to 
the traditional protocols where all page faults on a node are processed sequentially, con- 
current requests may be processed in parallel in a multithreaded context, should they 
concern the same page or different pages. 

3.1 Sequential Consistency 

We provide two protocols for sequential consistency. The 1 i_hudak protocol relies 
on a variant of the dynamic distributed manager MRSW (multiple reader, single writer) 
algorithm described by Li and Hudak |Q, adapted by Mueller m . It uses page repli- 
cation on read fault and page migration on write fault. Note that in a multithreaded 
context, the single writer refers to a node, not to a thread, since all the threads on the 
‘writer’ node share the same copy. They may thus write it concurrently. 

Alternatively, DSM-PM2 provides a new protocol for sequential consistency based 
on thread migration (migrate_thread), illustrated in FigureQ When a thread ac- 
cesses a page and does not have the appropriate access rights, it executes the page fault 
handler which simply migrates the thread to the node owning the page (as specified by 
the local page table). On reaching the destination node, the thread exits the handler and 
repeats the access, which is now successfully carried out and the thread continues its 
execution. Note the simplicity of this protocol, which essentially relies on a single func- 
tion; the thread migration primitive provided by PM2. The counterpart is that the pages 
are not replicated in this protocol (i.e., for each page, there is a unique node where the 
page can be accessed both for read and write), so that all threads accessing a non local 
page will migrate to the corresponding owning node. Though the migration cost is gen- 
erally very low, the efficiency of this protocol is highly influenced by the distribution 
of the shared data, which has a direct impact on the load balancing (since the threads 
migrate to the data they access). This point is discussed in Section0 

The protocol described above crucially depends on an iso-address approach to data 
allocation Q: not only static, but also dynamically allocated DSM pages are mapped 
at the same virtual address on all nodes, using the isomalloc allocation routine of 
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Fig. 3. Sequential consistency using thread migration: on page fault, the thread migrates 
to the node where the data is located. 



PM2. On exiting the fault handler after migration, the thread automatically repeats the 
access at the same address, which does correspond to the same piece of data. 

3.2 Release Consistency 

DSM-PM2 also provides two alternative implementations for release consistency. The 
erc_sw protocol is a MRSW protocol for eager release consistency. It uses page repli- 
cation on read fault and page migration on write fault, based on the same dynamic 
distributed manager scheme as li_hudak. Page ownership migrates along with the 
write access rights. Pages in the copyset get invalidated on lock release. 

Alternatively, the hbrc_mw protocol is a home-based protocol allowing multiple 
writers (MRMW protocol) by using the ‘classical’ twinning technique described in |[l4|. 
Essentially, each page has a home node, where all threads have write access. On page 
fault, a copy of the page is brought from the home node and a twin copy gets created. 
On release, page diffs are computed and sent to the home node, which subsequently 
invalidates third-party writer nodes. On receiving such an invalidation, these latter nodes 
need to compute and send their own diffs (if any) to the home node. 

3.3 Java Consistency 

DSM-PM2 provides two protocols which directly implement consistency as specified 
by the Java Memory Model ^3 (we refer to this consistency using the term “Java con- 
sistency”). Thanks to these protocols, DSM-PM2 is currently used by the Hyperion Java 
compiling system |Ol and consequently supports the execution of compiled threaded 
Java programs on clusters. Our DSM-PM2 protocols were co-designed with Hyper- 
ion’s memory module and this approach enabled us to make aggressive optimizations 
using information from the upper layers. For instance, a number of synchronizations 
could thereby be optimized out. 
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The Java Memory Model allows threads to keep locally cached copies of objects. 
Consistency is provided by requiring that a thread’s object cache be flushed upon en- 
try to a monitor and that local modifications made to cached objects be transmitted to 
the central memory when a thread exits a monitor. Gontmakher and Schuster have 
shown that the JMM provides Release Consistency for synchronized access to non- 
volatile variables and stricter forms of consistency for the other cases. That is, Java 
Consistency is equivalent to Release Consistency in most cases. 

The concept of main memory is implemented with DSM-PM2 via a home-based 
approach. The home node is in charge of managing the reference copy. Objects (initially 
stored on their home nodes) are replicated if accessed on other nodes. Note that at most 
one copy of an object may exist on a node and this copy is shared by all the threads 
running on that node. Thus, we avoid wasting memory by associating caches to nodes 
rather than to threads. 

Since Hyperion uses specific access primitives to shared data (get and put), we 
can use explicit checks to detect if an object is present (i.e., has a copy) on the local 
node, thus by-passing the page-fault mechanism. If the object is present, it is directly 
accessed, else the page containing the object is brought to the local cache. This scheme 
is used by the j ava_ic protocol (where ic stands for inline check). Alternatively the 
java_pf protocol uses page faults to detect accesses to non-local objects (hence the 
pf suffix). Through the put access primitives, the modifications can be recorded at the 
moment when they are carried out, with object-field granularity. All local modifications 
are sent to the home node of the page by the main memory update primitive, called by 
the Hyperion run-time on exiting a monitor. 



4 Performance Evaluation 



We present the raw performance of our basic protocol primitives on four different plat- 
forms. The measurements were first carried out on a cluster of 450 MHz PII nodes 
running Linux 2.2.13 interconnected by a Myrinet network using the BIP and TCP pro- 
tocols and by a Fast Ethernet network under TCP. Then, the same measurements were 
realized on a cluster of PII 450 MHz nodes interconnected by a SCI network. 

TableQreports the time (in ps) taken by each step involved when a read fault occurs 
on a node, assuming that the corresponding protocol is page-transfer based (which is 
the case for all built-in protocols, except for migrate_thread). First, the faulting 
instruction leads to a signal (page fault), which is caught by a handler that inspects the 
page table to locate the page owner and then requests the page to this owner (request 
page). The request is processed on the owner node and the required page is sent to the 
requester (page transfer). The time reported here corresponds to a common 4 kB page. 
Finally, the protocol overhead includes the request processing time on the owner node 
and the page installation on the requesting node. 

As one can observe, the protocol overhead of DSM-PM2 is only up to 15% of the 
total access time, as most of the time is spent with communication. The protocol over- 
head essentially consists in updating page table information and setting the appropriate 
access rights. 
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Table 3. Processing a read-fault under page-migration policy: Performance analysis. 



Operation 


BIP/Myrinet 


TCP/Myrinet 


TCP/Fast Ethemel 


SISCI/SCl 


Page fault 


11 


11 


11 


11 


Request page 


23 


220 


220 


38 


Page transfer 


138 


343 


736 


119 


Protocol overhead 


26 


26 


26 


26 


Total (ps) 


198 


600 


993 


194 



In Tableflwe report the cost (in /xs) for processing a read fault assuming a thread- 
migration based implementation of the consistency protocol. The protocol overhead is 
here insignificant (less than 1 /xs), since it merely consists of a call to the underlying run- 
time to migrate the thread to the owner node. In PM2, migrating a thread means moving 
the thread stack and the thread descriptor to the destination node, possibly together with 
some private dynamically allocated data (which is not the case in this example). 



Table 4. Processing a read-fault under thread-migration policy: Performance analysis. 



Operation 


BIP/Myrinet 


TCP/Myrinet 


TCP/Fast Ethemel 


SISCI/SCl 


Page fault 


11 


11 


11 


11 


Thread migration 


75 


280 


373 


62 


Protocol overhead 


1 


1 


1 


1 


Total (ps) 


87 


292 


385 


74 



We can observe that this migration-based implementation outperforms the previous 
one, because thread migration is very efficient. Note however, that this migration time is 
closely related to the stack size of the thread. In our test program, the thread’s stack was 
very small (about 1 kB), which is typically the case in many applications, but not in all 
applications. Thus, choosing between the implementation based on page transfer and 
the one based on thread migration deserves careful attention. Moreover, it may depend 
on other criteria such as the number and the location of the threads accessing the same 
page, and may be closely related to the load balance, as illustrated below. This is a 
research topic we plan to investigate in the future. 

To illustrate DSM-PM2’s ability to serve as an experimental platform for comparing 
consistency protocols, we have run a program solving the Traveling Salesman Problem 
for 14 randomly placed cities, using one application thread per node. Figure 0 presents 
run times for our 4 protocols implementing sequential and release consistency, on the 
BIP/Myrinet platform. Given that the only shared variable intensively accessed in this 
program is the current shortest path and that the accesses to this variable are always 
lock protected, the benefits of release consistency over sequential consistency are not 
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Fig. 4. Solving TSP for 14 cities with random inter-city distances: Comparison of 4 
DSM protocols. 



illustrated here. But we can still remark that all protocols based on page migration per- 
form better than the protocol using thread migration. This is essentially due to the fact 
that all computing threads migrate to the node holding the shared variable, which thus 
gets overloaded. We could expect a better behavior for this protocol with applications 
where shared data are evenly distributed across nodes and uniformly accessed. 

To compare our two protocols for Java consistency, we have run a multithreaded 
Java program implementing a branch-and-bound solution to the minimal-cost map- 
coloring problem, compiled with Hyperion |Q. The program was run on a four-node 
cluster of 450 MHz Pentium II processors running Linux 2.2.13, interconnected by a 
SCI network using the SISCI API and solves the problem of coloring the twenty-nine 
eastern-most states in the USA using four colors with different costs. Figure Qclearly 
shows that the protocol using access detection based on page faults (j ava_pf) outper- 
forms the protocol based on in-line checks for locality ( j ava_ic). This is due to the 
intensive use of objects in the program: remember that every get and put operation 
involves a check for locality in java_ic, whereas this is not the case for accesses 
to local objects when using j ava_pf . The overhead of fault handling appears to be 
signihcantly less important than the overhead due to checks, also thanks to a good dis- 
tribution of the objects: local objects are intensively used, remote accesses (generating 
faults for j ava_pf ) are not very frequent. 

Of course, we are aware that the performance evaluation reported above can only be 
considered as preliminary. A more complete analysis is necessary to study the behavior 
of the DSM-PM2 protocols with respect to different classes of applications illustrating 
various sharing patterns, access patterns, synchronization methods, etc. This is part of 
our current work. 

Finally, we can mention that very precise post-mortem monitoring tools are avail- 
able in the PM2 platform, providing the user with valuable information on the time 
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12 4 



nodes 

Fig. 5. Comparing the two protocols for Java consistency: Page faults vs. In-line checks. 



spent within each elementary function. This feature proves very helpful for understand- 
ing and improving protocol performance. 



5 Conclusion 

DSM-PM2 is a platform for designing, implementing and experimenting with multi- 
threaded DSM consistency protocols. It provides a generic toolbox which facilitates 
protocol design and allows for experimentation with alternative protocols for a given 
consistency model. DSM-PM2 is portable across a wide range of cluster architectures, 
using high-performance interconnection networks such as BIP/Myrinet, SISCI/SCI, 
VIA, as well as more traditional ones such as TCP, and MPI. In this paper, we have 
illustrated its power by presenting different protocols Implementing sequential consis- 
tency, release consistency and Java consistency, on top of different cluster architectures: 
BIP/Myrinet, TCP/Myrinet, TCP/FastEthernet, SISCI/SCI. 

DSM-PM2 is not just yet another multithreaded DSM library. It is aimed at explor- 
ing a new research direction, namely providing the designers of such protocols with 
portable platforms to experiment with alternative designs, in a generic, customizable 
environment, while providing tools for performance prohling, such as post-mortem 
analysis. We are convinced that many interesting ideas in DSM protocols could be 
more easily experimented using such an open platform: implementing everything from 
scratch is simply too hard! Also, such a platform enables competing protocol design- 
ers to compare their protocols within a common environment, using common prohling 
tools. Switching from one protocol to another, or switching from one communication 
library to another, can be done without changing anything to the application. No re- 
compiling is even needed if all the necessary routines have been linked beforehand. 
Finally, such a platform opens a large access to the area of co-design: indeed, the appli- 
cation and the protocol can then be designed and optimized together, instead of simply 
tuning the application on top of a hxed, existing protocol. This idea seems of partic- 
ular interest in the case of compilers targeting DSM libraries, as demonstrated by the 
Hyperion Java compiler project reported above. 
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Currently, DSM-PM2 is operational on Linux 2.2.x and Solaris 6 or later. Extensive 
testing has been done on top of SISCI/SCI, TCP/Myrinet and BIP/Myrinet. All the pro- 
tocols mentioned in Table Qare available and hybrid protocols mixing thread migration 
and page replication can also be built out of library functions. We are currently work- 
ing on a more thorough performance evaluation using the SPLASH-2 O benchmarks, 
which will be helpful to guide an efficient protocol use in applications. 
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Abstract. This paper presents the latest version of the SKiPPER skeleton-based 
parallel programming environment dedicated to fast prototyping of vision ap- 
plications. Compared to the previous version, its main innovative feature is the 
ability to handle arbitrary skeleton nesting. 



1 Introduction 

The goal of the SKiPPER project is to develop a parallel programming methodology 
allowing fast prototyping of vision applications on embedded and/or dedicated parallel 
platforms. This methodology should allow feld application programmers - as opposed 
to parallel programming specialists - to build parallel vision applications easily while 
preserving effeienc y. 

The SKiPPER-1 suite of tools, described for instance in 1^, 1 was the ffst 

instantiation of this methodology. It is however limited in terms of skeleton composi- 
tion. In particular, it cannot accommodate arbitrary skeleton nesting. The SKiPPER-II 
implementation described in this paper is an attempt to solve this problem. 

The paper is organised as follows. SectionQbrieiy recalls the skeleton-based paral- 
lel programming concepts in general and their incarnation in SKiPPER-I in particular. 
SectionQis a presentation of the concepts and techniques used in SKiPPER-II. Section 
Hgives some (very preliminary) results. SectionQis a short review of related work on 
skeleton nesting. And fiially sectionQconcludes the paper. 

2 Skeleton-Based Parallel Programming and SKiPPER-I 

Skeleton-based parallel programming methodologies provide a way for concili- 

ating fast prototyping and efftienc y. They aim at providing user guidance and a mostly 
automatic procedure for designing and implementing parallel applications. Eor that pur- 
pose, they provide a set of algorithmic skeletons, which are higher-order program con- 
structs encapsulating common and recurrent forms of parallelism to make them readily 
available for the application programmer. The latter does not need to take into account 
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low-level implementation details such as task partitioning and mapping, data distribu- 
tion, communication scheduling and load-balancing. The overall result is a signilfcant 
reduction of the design-implement- validate cycle time - especially on dedicated parallel 
platforms. 

Due to our interest in image processing, we have designed and implemented a 
skeleton-based parallel programming environment, called SKIPPER, based on a set of 
skeletons speciftally designed for parallel vision applications 0,QT3,EI1,O- This 
library of skeletons was designed from a retrospective analysis of existing parallel code. 
It includes four skeletons: 



- SCM, Split-Compute-Merge skeleton, 

- DF, Data Farming, 

- TF, Task Farming (a recursive version of the DF skeleton). 



- ITERMEM, Iteration with Memory, 

The SCM skeleton is devoted to regular, ’geometric” processing of iconic data, in 
which the input image is split into a fk ed number of sub-images, each sub-image is pro- 
cessed independently and the ffaal result is obtained by merging the results computed on 
sub-image (the sub-images may overlap and do not systematically cover the entire input 
image). This skeleton is applicable whenever the number of sub-images is lx ed and the 
amount of work on each sub-image is the same, resulting in an very even work-load. 
Typical examples include convolutions, median-fltering and histogram computation. 

The DF skeleton is a generic harness for so-called process farms. A process farm 
is a widely used construct for data-parallelism in which a farmer process has access 
to a pool of worker processes, each of them computing the same function. The farmer 
distributes items from an input list to workers and collects results back. The effect is 
to apply the function to every data item. We use a variant of this scheme in which the 
results collected by the farmer are accumulated using a speciffc user function instead of 
being just added to an output list. The DF skeleton shows its utility when the applica- 
tion requires the processing of irregular data, for instance an arbitrary list of windows 
of different sizes. In this case, a static allocation of computing processes to processors 
(like with the SCM skeleton) is not always possible and would result, anyway, to an un- 
even work-load between processors (which in turn results in a poor efftienc y). The DF 
skeleton handles this situation by having the farmer process directly doling out com- 
puting processes allocation to worker processes. Typically, the farmer starts by sending 
a packet to each worker, then waits for a result from a worker and immediately send 
another packet to him. This until no packets are left and the workers are no longer pro- 
cessing data. Each worker, on the other side, simply waits for a packet, processes it and 
returns the result to the farmer until it receives a stop condition from the farmer. This 
technique gives an inherent, primitive load-balancing. It is only effcient, however, if 
there are more data items than processors. 
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The TF skeleton may be viewed as a generalisation of the DF one, in which the 
processing of one data item by a worker may recursively generate new items to be 
processed. These data items are then returned to the farmer to be added to a queue 
from which tasks are doled out (hence the name task farming). A typical application 
of the TF skeleton is image segmentation using classical recursive divide-and-conquer 
algorithms. 

The ITERMEM skeleton does not properly speaking encapsulate parallel behaviour, 
but is used whenever the iterative nature of the real-time vision algorithms, i.e. the fact 
that they do not process single images but continuous streams of images, has to be 
made explicit. A typical situation is when computations on the n*^ image depend on 
results computed on the n-1*^ (or n-k*^). Such ’feedback” patterns are very common in 
tracking algorithms for instance, where a model of the system state is used to predict the 
position of the tracked objects in the next image. Another example is motion detection 
by frame-to-frame difference. The ITERMEM skeleton will always be the ’top-le vel” 
skeleton, taking as parameter either a sequential function or a composition of other 
skeletons. The latter case is the only situation in which SKiPPER-1 supports nested 
skeletons. 

Using SKiPPER, the application designer: 

- provides the source code of the sequential application-specift functions, 

- describes the parallel structure of his application in terms of composition of skele- 
ton chosen in the library. 

This description is made by using a subset of the CAML functional language, as 
shown on the Figure Q where a SCM skeleton is used to express the parallel com- 
putation of an histogram using a geometric partitioning of the input image. On this 
fgure, row_partition, seq_histo, merge_histo and display Jiisto are 
the application-specift sequential functions (written in C) and scm is the above men- 
tioned skeleton. This CAML program is the so-called skeletal program specification. 
In SKiPPER-I, it is turned into executable code by fist translating it into a graph of 
parametric process templates and then mapping this graph onto the target architecture. 

let img = read_img 512 512 ; ; 

let histo = scm 8 ro w p artition seq histo merge_histo img / ; 

let main = display_histo img histo ; ; 



Fig. 1. A ’’skeletal” program in CAML. 



Completely designed by the end of 1998, SKiPPER-1 has already been used for im- 
plementing several realistic parallel vision applications, such as connected component 
labelling IdSfl, vehicle tracking road tracking/reconstruction O' 

But the current version of SKiPPER does not support skeleton nesting, i.e. the ability 
for a skeleton to take another skeleton as argument. Arbitrary skeleton nesting raises 
challenging implementation issues as evidenced in ^3, ^3 >1^3 section 

Q. This paper speciftally deals with these issues by presenting a new implementation 
of the SKiPPER parallel programming environment (referred to as SKiPPER-II in the 
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sequel) supporting arbitrary nesting of skeletons. This implementation is based on a 
completely revised execution model for skeletons. Its three main features are: 

- the reformulation of all skeletons as instances of a very general one: a new version 
of the Task Farming skeleton (called TF/II), 

- a fully dynamic scheduling mechanism (scheduling was mainly static in SKiPPER- 

I), 

- a portable implementation of skeletons based on a MPI communication layer. 

3 SKiPPER-II 

One of the main features of the SKiPPER-I environment is that it relied on a mostly 
static execution model for skeletons. More precisely, most of the decisions regarding 
distribution of computations and scheduling of communications were made at compile- 
time by a third-party CAD software called SynDEx o. This implementation path, 
while resulting in very effcient distributed executives for ‘fetatic” - by static we mean 
that the distribution and scheduling of all communications does not depend on input 
data and can be predicted at compile-time - did not directly support ‘dynamic” skele- 
tons, in particular those based upon data or task farming (DF and TF). The intermediate 
representation of DF and TF was therefore rather awkward in SKiPPER-I, relying on 
ad-hoc auxiliary processes and synchronisation barriers to hide dynamically scheduled 
communications from the static scheduler [ y |. 

Another point about SKiPPER-1 is that the target executable code was MPMD: 
the fhal parallel C code took the form of a set of distinct main functions (one per 
processor), each containing direct calls to the application-specift sequential functions 
interleaved with communications. 

By contrast, execution of skeleton-based applications in SKiPPER-II is carried on 
by a single program (the ‘kernel” in the sequel) running in SPMD mode on all proces- 
sors and ensuring a fully dynamic distribution and scheduling of processes and commu- 
nications. The kernel’s work is to: 

- run the application by interpreting an intermediate description of the application 
obtained from the CAML program source, 

- emulate any skeletons of the previous version of SKiPPER, 

- manage resources (processors) for load-balancing when multiple skeletons must 
run simultaneously. 

In SKiPPER-II, the target executable code is therefore made up of the kernel and the 
application-specift sequential functions. In effect the kernel acts as a small (distributed) 
operating system that provides all routines the application needs to run on a processor 
network. 

The overall software architecture of the SKiPPER-Il programming environment is 
given on EigureQ the skeletal specification in CAML is analysed to produce the interme- 
diate description which is interpreted at run-time by the kernel; the sequential functions 
and the kernel code are compiled together to make the target executable code. These 
points will be detailed in the next sections. Note that fgure Q compared to the corre- 
sponding one given for SKiPPER-1 (in iBl for example), shows a signiftant reduction 
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in the number of implementation steps. This of course can be explained by the fact that 
many decisions that were taken at compile-time in the previous version are now taken 
dynamically (at run-time) by the kernel. 




Fig. 2. SKiPPER-II Environment. 



3.1 Intermediate Description 

Clearly, the validity of the ’kernel-based” approach presented above depends on the 
defiiition of an adequate intermediate description. It is the interpretation (at run-time) 
of this description by the kernel that will trigger the execution of the application- specifi 
sequential functions on the processors, according to the data dependencies encoded by 
the skeletons. 

A key point about SKiPPER-II is that, at this intermediate level of description, all 
skeletons have been turned into instances of a more general one called TF/II. The op- 
erational semantics of the TF/II skeleton is similar to the one of DF and TF: a master 
(farmer) process still doles out tasks to a pool of worker (slave) processes, but the tasks 
can now be different {i.e. each worker can compute a different function). 

The rationale for this ‘hniformisation” step is three-fold. 

- First, it makes skeleton composition easier, because the number of possible combi- 
nations now reduces to two (TF/II followed by TF/II or TF/II nested in TF/II). 

- Second, it greatly simplifes the structure of the run-time kernel, which only has to 
know how to run a TF/II skeleton. 

- Third, there is only one skeleton code to design and maintain, since all other skele- 
tons will be defiled in terms of this generic skeleton. 

The above-mentioned transformation is illustrated on FigureQwith a SCM skeleton. 
In this tgure white boxes represent pure sequential functions and grey boxes ’’support” 
processes (parameterised by sequential functions). Note that at the CAML level, the 
programmer still uses distinct skeletons (SCM, DF, TF,. . . ) when writing the skeletal 
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description of his application. The transformation is done by simply providing alterna- 
tive defhitions of the SCM, DF, TF,... higher-order functions in terms of the TF/II one. 
Skeleton composition is expressed by normal functional composition. The program ap- 
pearing in Figure0 for example, can be written like this in CAML: 

let nested x = scm 2 s2 f2 m2 x; ; 
let main x = scm 2 si nested ml x; ; 

The intermediate description itself - as interpreted by the kernel - is a tree of TF/II 
descriptors, where each node contains informations to identify the next skeleton and to 
retrieve the C function run by a worker process. Figure0shows an example of the corre- 
sponding data structure in the case of two nested SCM skeletons. In the current version, 
the intermediate description is written manually in the form of a C module linked with 
the kernel/application code before execution. It is intended to be automatically gener- 
ated from the CAML source program using a modifed version of the CAMLFLOW tool 
|1X| already used in SKiPPER-I. 




SCM TF/II 



I; input function S: split function F: compute function 
O: output function M: merge function 

Fig. 3. SCM — > TF/II transformation. 



3.2 Execution Model 

Within our fully dynamic execution model, skeletons are viewed as concurrent pro- 
cesses competing for resources on the processor network. 

When a skeleton needs to be run, and because any skeleton is now viewed as a TF/II 
instance, a kernel copy acts as the master process of the TF/II. This copy manages all 
data transfers between the master and the worker (slave) processes of the TF/II. Slave 
processes are located on resources allocated dynamically by the master. This way kernel 
copies interact to emulate skeleton behaviour. 

This illustrated on FigureQwith a small program showing two nested SCM skele- 
tons. This fgure shows the role of each kernel copy (two per processor in this case) in 
the execution of the intermediate description resulting from the the transformation of 
the SCM skeletons into TF/II ones. 
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Because any kernel copy knows when and where to start a new skeleton without 
asking information to copies, the scheduling of skeletons can be distributed. Indeed 
each copy of the kernel has its own copy of the intermediate description of the applica- 
tion. This means that each processor can start the necessary skeleton when it is needed 
because it knows which skeleton has stopped. A new skeleton is started whenever the 
previous one (in the intermediate description) ends. The next skeleton is always started 
on the processor which has run the previous skeleton (because this resource is supposed 
to be free and closer than the others!). 

Because we want to target dedicated and/or embedded platforms, the kernel was 
designed to work even if the computing nodes are not able to run more than one process 
at a time (no need for multitasking). 

Finally, in the case of a lack of resources the kernel is able to run some of the 
skeletons in a serial manner, including the whole application, thus providing a straight- 
forward sequential emulation facility for parallel programs. 

3.3 Management of Commnnications 

The communication layer is based on a reduced set of the MPI library functions (typ- 
ically MPLSSend or MPIJRecv), thus increasing the portability of skeleton-based ap- 
plications across different parallel platforms. We use only synchronous communication 
functions whereas asynchronous functions may perform much better in some cases, es- 
pecially when the platform has a specialised coprocessor for communicating and when 
communications and processing can overlap. This restriction is a consequence of our 
original experimental platform which did not support asynchronous communications. 

4 Preliminary Experiments 

Results presented here are preliminary ones, obtained with a prototype implementation 
of SKiPPER-II on a cluster of 8 Sun workstations (Sun Ultra 5) interconnected with a 
100 Mbits Ethernet network (effective bandwidth is about 8.5 Mbytes/s). 

The application used for the benchmark is a ’’synthetic” one. Even if it is too simple 
to reftct the complexity of real vision applications, it allows us to easily change its 
parameters to measure different features of interest. It is composed of two identical 
SCM skeletons nested in another one (it is very close to the one appearing in FigureQ. 
The outer SCM skeleton splits the input data (an array of N bytes) into two arrays of 
size N/2. Each inner SCM splits its input array into n/2 arrays, where n is the number 
of available machines (nodes). Each slave computes C/n fibating-point operations. The 
value of N and C may be freely adjusted. This application was also programmed using 
raw MPI calls, to provide a basis for comparison (in this version, the nested structure 
of the application has been fattened, i.e. we just make a single scatter, followed by a 
broadcast). For the SKiPPER-II version, we put two kernel copies per node. 

Results are given in Figures0Dand0for three conlgurations: 

- C =100MFlops, N=1 Mbytes, 

- C =100 MFlops, A^=10 Mbytes, 

- C =10MFlops, A^=10 Mbytes. 
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The FigureQexhibits a case in which a high compute/communication can in theory 
lead to almost linear speedups. Here the performances of SKiPPER-II and C/MPI are 
very close. The overhead of the former - mainly due to kernel - is less is less than 10 %. 
It can be explained by 

- the search for free resources by the kernel (involving a request to a centralised 

resource manager in the current implementation) 

- the need to run inner master processes. 

The infiience of higher communication costs on the relative performances of the 
two implementations is evidenced in FigureQ Here, the fact that SKiPPER-II performs 
more communications than raw MPI (because of the added communications between 
the outer and inner masters) results in a signiftantly higher (up to 50 %) overhead. 

This effect has been investigated further in Figure Q showing a very low com- 
pute/communication ratio. Despite the low parallel eftfcienc y of the application - evi- 
denced by the almost non-accelerated MPI version - it is encouraging that the behaviour 
of SKiPPER-II draws near to the one of the latter when the number of nodes increases. 

5 Related Work 

The P^L project has developed a fully-ftdged parallel programming lan- 

guage based on the concept of skeletons and associated implementation templates. A 
distinction is made between task parallel skeletons (’farm” and ’j3ipe’), data parallel 
skeletons (’hiap” and Teduce’) and control skeletons (’loop’). These skeletons are 
similar to ours and skeleton nesting is allowed in P^L. Sequential parts of a P^L ap- 
plication may be written in many sequential languages (C, Pascal, ...) and skeletons 
are introduced as special constructs using a C-like syntax. The P^L compilers gener- 
ate code for Transputer-based Meiko machines and for PVM running on a cluster of 
UNIX workstations. Recently Danelutto et al. |Q| have proposed an integration of the 
P^L skeletons within the CaML language, making program speciftations looking very 
similar to ours. Both P^L and OCamlP^L require either a good OS-level support (UNIX 
socket) or a generic message passing library (MPI) for their implementation. 

Moreover Danelutto et al. proposed a new implementation model for skeletons 0 
which is based on a distributed Macro Data Flow interpreter. This implementation 
model is close to ours: the skeleton code (which describes the application) is evalu- 
ated at run-time by an interpreter, and this interpreter is distributed. The difference lies 
in the intermediate representation: a data-fb w graph for [ f' I, a tree of TF/II skeletons 
for SKiPPER-II. 

Michaelson et al. at Edinburgh also developed a parallel programming system which 
allows skeleton nesting 13,0,0,0. They have signifcant experience in the ap- 
plication of functional programming to vision applications [Q, | TTSl . Skeletons are de- 
filed as higher-order functions in ML (as we do), and their latest defiiition of a vision- 
specitfc skeleton basis is very similar to ours (with comparable defhitions for the skele- 
tons). 

Their implementation, supporting dynamic execution of skeletons, relies on MPI 
process groups to allow concurrent access to computing resources when several skele- 
tons must simultaneously run. This approach requires a message passing layer with 
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routines to manage groups of processes. Our approach does not. Moreover, we never let 
the message passing layer allocates resources itself: all allocations are made internally 
by the kernel, allowing some filer decisions to be taken according to the global skeletal 
structure. 

Their work also differs from ours in two other points. First, their goal is an implicitly 
parallel system within which the decision of expanding a skeleton higher-order function 
into parallel constructs at the implementation level is taken by the compiler and not 
in the hand of the programmer. Second, no provision appear to be made for reusing 
sequential functions written in C (all the program is written in ML). 



6 Conclusion and Perspective 

The improvements realized in the new version of our SKiPPER parallel programming 
environment cover two essential aspects in skeleton implementation: 

- allowing skeleton composition, and especially the case of skeleton nesting, 

- relying on dynamic (runtime) scheduling of skeletons. 

The implementation model of skeletons was changed in a way that all skeletons can 
now be described using the same execution model. This was achieved by reformulating 
all skeletons as instances of a very general purpose one: the TF/II skeleton. This point 
offers possibilities in optimising the application since transformation rules could be 
applied to the TF/11 tree. 

Preliminary experiments show that skeleton nesting after conversion of all skeletons 
in TF/Il is easy to do and that, at least in the tested operational conditions, the cost 
penalty associated with the new implementation and execution models is not prohibitive 
(compared to hand-written MPI code). 

Work remains to be done to further assess the performances of SKiPPER-II on 
larger scale platforms. To do so, investigations are planned on a Cray T3E and large 
workstation clustersQ. 
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Support process 



User sequential function 



Intermediate description; 

3. Next skeleton = None 
Split function = S3 
Merge function = M3 
Slave function = F3 
Slave function type = User function 
Nested skeleton = None 

2. Next skeleton = None 
Split function = S2 
Merge function = M2 
Slave function = F2 
Slave function type = User function 
Nested skeleton = None 



1 . Next skeleton = 3 
Split function = SI 
Merge function = Ml 
Slave function = None 
Slave function type = Skeleton 
Nested skeleton = 2 



When ’slave function type’ is set to ’Skeleton’ 
then ’Nested skeleton’ field is used to know 
which skeleton must be used as a slave, that is 
to say which skeleton must be nested in. 



Fig. 4. Intermediate description data structure example. 



Skeleton 2: (TF/II) 
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Original user’s application graph 
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Fig. 5. Example of the execution of two SCM nested in one SCM. 
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2 SCM nested in 1 SCM (N=l Mbytes data, C=100 Mflops compute) 




Fig. 6. SKiPPER-II and C/MPI implementations comparison for 
C=100MFlops/N=l Mbytes. 



2 SCM nested in 1 SCM (N=10 Mbytes data, C=100 Mflops compute) 




Fig. 7. SKiPPER-II and 
C=100 MFlops/N=10 Mbytes. 



C/MPI implementations comparison for 






Completion time (s) 
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2 SCM nested in 1 SCM (N=10 Mbytes data, C=10 Mflops compute) 




Fig. 8. SKiPPER-II and C/MPI implementations comparison 
C=10 MFlops/N= 10 Mbytes. 
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Abstract. The Java platform has many characteristics which make it very desir- 
able for integrated continuous media processing. Unfortunately, it lacks the nec- 
essary CPU resource management facility to support Quality of Service 
guarantees for soft real-time multimedia tasks. In this paper, we present our new 
Java Virtual Machine, Q-JVM, which brings CPU resource management to the 
Java platform. Q-JVM is based on Sun’s reference implementation. It incorpo- 
rates an enhanced version of the MTR-LS algorithm in its thread scheduler. 
Combined with an optional admission control mechanism, this algorithm is able 
to support QoS parameters such as fairness, bandwidth partitioning and delay 
bound guarantees, as well as the cumulative service guarantee. Preliminary 
experimental results show that Q-JVM is backward compatible with the stan- 
dard version from Sun, has low scheduling overhead, and is able to provide QoS 
guarantees as specified. 



1 Introduction 

Integrated continuous media processing presents unique challenges to the underlying 
computational environment: it imposes real-time requirements on the host operating 
system and its subsystems, as continuous media data must be presented continuously 
in time at a predetermined rate in order to convey their meaning. However, software 
that process continuous media data is often classified as being soft real-time, as miss- 
ing a particular deadline is not fatal as long as it is not missed by too much and most 
other deadlines are not missed. 

The Java platform [1] has many desirable characteristics for integrated multime- 
dia processing. Java is a simple and small language. It is object-oriented and supports 
many language features such as interfaces and automatic memory management, which 
make it a robust environment for software development. It also supports multithread- 
ing at the language level with built-in synchronization primitives, thus allowing a high 
degree of interactivity with the end user. Moreover, Java has a rich collection of appli- 
cation programming interfaces which support media manipulation and continuous 
media processing. Most importantly, Java was designed for embedded applications, 
and is ideal for media-capable integrated devices of the future. 

F. Mueller (Ed.): HIPS 2001, LNCS 2026, pp . 86-94, 2001. 

© Springer-Verlag Berlin Heidelberg 1998 
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Unfortunately, Java does not have the facility to support soft real-time tasks. Real- 
time programming is a matter of managing resources, most notably the CPU resource. 
However, Java does not provide any mechanism which can be used to monitor, man- 
age, or police the usage of the CPU resource. 

In this paper, we present a new Java Virtual Machine with QoS features, the Q- 
JVM, based on Sun’s JVM. Its purpose is to support integrated continuous media pro- 
cessing on mobile or embedded devices, such as PDAs and TV set-top boxes, where 
soft real-time guarantees must be provided with limited system resources. 

Our new JVM employs an enhanced version of the Move-To-Rear List Schedul- 
ing algorithm [2] in its thread scheduler. The service fraction-based scheduling algo- 
rithm is enhanced to handle not only user threads, but also system threads which must 
express their urgency using priorities. 

Preliminary test results have shown that Q-JVM has low scheduling overhead, 
and is able to provide QoS guarantees as specified. Moreover, it is binary compatible 
with the standard JVM distributed by Sun. 

The rest of this paper is organized as follows. Section 2 discusses related research 
in resource and Quality of Service management. Section 3 describes our implementa- 
tion of Q-JVM which supports CPU resource management and some of our experi- 
ments with this platform. Finally, Section 4 briefly summarizes this paper. 

2 Previous Work 

There has been a lot of research in CPU resource management for soft real-time appli- 
cations. Some researchers have studied existing systems and algorithms that claim to 
support real-time multimedia applications [7]. At the same time others like [5] have 
borrowed ideas from link scheduling, or have proposed new algorithms [2] for manag- 
ing the CPU resource. 

2.1 Static Priority Based Scheduling 

The most common CPU resource management schemes employ static priority-based 
scheduling. In such a scheme, all execution entities are assigned fixed priorities. The 
scheduler rations the CPU among competing scheduling entities according to their pri- 
orities. However, an extensive quantitative analysis of such a scheduler, as imple- 
mented in UNIX System V Release 4, demonstrated that it is largely ineffective in 
dealing with soft real-time tasks [7]. It could even produce system lockup. 

Similarly, hard real-time scheduling algorithms, such as Earliest Deadline First 
and the Rate Monotonic Algorithm, are also not suitable for integrated continuous 
media processing [7]. This class of algorithms either fails to achieve the desired effi- 
ciency for integrated computing environments, or requires prior analysis of computa- 
tional requirements of the particular application mix. The latter is difficult, if not 
impossible, for dynamic systems. 
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2.2 Resource Management Based on Fair Queuing 

Network bandwidth resource management has long been a hot research topic. A large 
body of work exists in the scheduling of network flows to network links. Many 
researchers have noticed the parallel between CPU scheduling and network link sched- 
uling. Thus, it is not surprising that many link scheduling algorithms have been 
adopted for CPU scheduling. Among these, the Start-time Fair Queuing (SFQ) algo- 
rithm [5] is a notable example. 

SFQ improves upon early fair queuing algorithms like Weighted Fair Queuing [3] 
and Self Clocked Fair Queuing [4] by removing their requirement for prior knowledge 
of the computational needs of competing tasks; it is also much more efficient than 
algorithms like Fair Queuing based on Start time [5]. Moreover, it handles fluctuation 
in available bandwidth due to sporadic interrupt processing better than other fair queu- 
ing algorithms [5]. Unfortunately, its delay bound increases linearly with the number 
of threads in the system [8], thus making it undesirable for complex and dynamic sys- 
tems. 

2.3 The MTR-LS Algorithm 

The Move-To-Rear List Scheduling algorithm (MTR-LS) is a resource scheduling 
algorithm which was developed at the Bell Laboratories to provide predictable service 
in a general purpose system with multiple resources including CPU, disk, and a net- 
work [2]. Besides the usual Quality of Service parameters such as fairness, bandwidth 
partitioning and delay bound, it also supports a new criterion called cumulative service 
guarantee: MTR-LS guarantees that the real service obtained by a process, given its 
specified service rate, on a shared server does not fall behind the ideal service it would 
have accumulated on a dedicated server at the same service rate, by more than a con- 
stant amount. 

Scheduling using MTR-LS is based on service fractions. A service fraction is a 
fraction assigned to a scheduling entity that represents the service rate of a virtual 
server in terms of the real server. A system constant, named the virtual time quantum 
T, is used to specify the total target time for servicing each and every active scheduling 
entity in the system exactly once. 

Each scheduling entity is also assigned a time stamp and a quantum left when it 
requests service. The quantum size is calculated as the product of its service fraction 
and T. All active scheduling entities are kept in the service list, and are sorted by their 
time stamps. Entities with earlier time stamps appear closer to the front of the list. The 
MTR-LS algorithm always schedules the first runnable entity on the service list. 

A scheduled entity is preempted when its quantum expires or when some other 
entity whose position on the service list is ahead of it becomes runnable. After the pre- 
emption, the time it has been serviced on the server is subtracted from its quantum left, 
to yield an updated value of left. If the result is zero, then it is assigned a new time 
stamp and its quantum is re-initialized. Its position on the service list is adjusted 
according to its new time stamp; i.e., it is moved to the rear of the list. 
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The MTR-LS algorithm provides bandwidth partitioning and fairness guarantees 
for all competing entities with respect to their service fraction allocations. With admis- 
sion control, it is also able to provide delay bounds and cumulative service guarantees 
[2]. Moreover, it is very efficient: even with a straightforward implementation, the 
computational complexity of the algorithm is 0{ln{n)) where n is the number of entries 
in the service list [2]. These properties persuaded us to build on the MTR-LS algorithm 
to provide Quality of Service guarantees for Java threads. 

3 Q-JVM 

Our enhanced version of JVM is based on Sun’s reference implementation of the Java 
virtual machine. The JVM source code obtained from Sun supports both the Windows 
and Solaris platforms; however, we chose to develop our Q-JVM on Solaris. 

Java threads may be mapped on to native operating system scheduling entities 
using One-to-One, Many-to-One or Many-to-Many models [6]. With a 1-to-l mapping, 
each Java thread is supported by its own scheduling entity known to the operating sys- 
tem. Scheduling is handled by the OS kernel; all threads have equal access to the ker- 
nel at the same time. Thus, this model is able to exploit any hardware parallelism that 
may be present. 

With the m-to-1 model, all Java threads are mapped onto the same scheduling 
entity supported by the OS, and scheduling of Java threads is handled by a user-level 
threads library. Only one scheduling entity is known to the operating system, and only 
one thread can access the kernel at any given time. This model does not exploit hard- 
ware parallelism; however, it has the advantage that the OS kernel is not required to 
support multithreading. Moreover, it is also very efficient, as all scheduling decisions 
and context switches can be handled in user space, without kernel intervention. This is 
a considerable advantage in uni-processor environments where additional expensive 
context switches to and from the kernel does not yield any benefit in terms of 
increased parallelism. It is thus ideal for mobile and embedded devices that do not 
have multiple processors, and where CPU bandwidth is at a premium. 

The m-to-n mapping model is the most elaborate. It uses a user-level threads 
library in conjunction with an OS kernel that supports multi-threading: Java threads 
are mapped onto a pool of scheduling entities known to the kernel. The threads library 
manages the pool of scheduling entities and the mapping between Java threads and 
kernel scheduling entities, while the kernel schedules only the entities known to it. 

On Solaris, one has the option of using either the Many-to-Many model with the 
Solaris native thread library, or the Many-to-One model with the Green Thread library. 
Since it is not common to have true hardware parallelism on our target platform, we 
decided to base our changes on Green Thread using the m-to-1 model. 

Our approach is to build the resource management facility into the lowest level of 
the JVM, in this case the Green Thread library, so that resource consumption by all 
threads, including the threads spawned by the JVM and its native libraries, can be 
managed effectively. 
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2.2 Resource Management Based on Fair Queuing 

Network bandwidth resource management has long been a hot research topic. A large 
body of work exists in the scheduling of network flows to network links. Many 
researchers have noticed the parallel between CPU scheduling and network link sched- 
uling. Thus, it is not surprising that many link scheduling algorithms have been 
adopted for CPU scheduling. Among these, the Start-time Fair Queuing (SFQ) algo- 
rithm [5] is a notable example. 

SFQ improves upon early fair queuing algorithms like Weighted Fair Queuing [3] 
and Self Clocked Fair Queuing [4] by removing their requirement for prior knowledge 
of the computational needs of competing tasks; it is also much more efficient than 
algorithms like Fair Queuing based on Start time [5]. Moreover, it handles fluctuation 
in available bandwidth due to sporadic interrupt processing better than other fair queu- 
ing algorithms [5]. Unfortunately, its delay bound increases linearly with the number 
of threads in the system [8], thus making it undesirable for complex and dynamic sys- 
tems. 

2.3 The MTR-LS Algorithm 

The Move-To-Rear List Scheduling algorithm (MTR-LS) is a resource scheduling 
algorithm which was developed at the Bell Laboratories to provide predictable service 
in a general purpose system with multiple resources including CPU, disk, and a net- 
work [2]. Besides the usual Quality of Service parameters such as fairness, bandwidth 
partitioning and delay bound, it also supports a new criterion called cumulative service 
guarantee: MTR-LS guarantees that the real service obtained by a process, given its 
specified service rate, on a shared server does not fall behind the ideal service it would 
have accumulated on a dedicated server at the same service rate, by more than a con- 
stant amount. 

Scheduling using MTR-LS is based on service fractions. A service fraction is a 
fraction assigned to a scheduling entity that represents the service rate of a virtual 
server in terms of the real server. A system constant, named the virtual time quantum 
T, is used to specify the total target time for servicing each and every active scheduling 
entity in the system exactly once. 

Each scheduling entity is also assigned a time stamp and a quantum left when it 
requests service. The quantum size is calculated as the product of its service fraction 
and T. All active scheduling entities are kept in the service list, and are sorted by their 
time stamps. Entities with earlier time stamps appear closer to the front of the list. The 
MTR-LS algorithm always schedules the first runnable entity on the service list. 

A scheduled entity is preempted when its quantum expires or when some other 
entity whose position on the service list is ahead of it becomes runnable. After the pre- 
emption, the time it has been serviced on the server is subtracted from its quantum left, 
to yield an updated value of left. If the result is zero, then it is assigned a new time 
stamp and its quantum is re-initialized. Its position on the service list is adjusted 
according to its new time stamp; i.e., it is moved to the rear of the list. 
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When invoked, a new Green Thread scheduler fetches the thread with the earliest 
time stamp from the runnable queue and checks its left value. If this value is less than 
or equal to zero, it moves this thread to the rear of the list by assigning it a new time 
stamp and reinserting it into the runnable queue. The left of this thread is re-initiated to 
be a • r + left , where a is the service fraction of this thread, T is the virtual time 
quantum, and left is this thread’s last left value. The scheduler function repeats this 
operation until it finds a thread with a positive left value. It then invokes the context 
switching code to record the start time of this quantum and transfer the control of CPU 
to the scheduled thread. 

3.2 Extension to the MTR-LS Algorithm 

The MTR-LS algorithm is used in Q-JVM to schedule not only user threads, but also 
system threads that must express their urgency using priorities. For example, the 
thread scheduler must recognize that the ClockHandler thread must be scheduled as 
soon as it is runnable and the idler thread should only be scheduled when there are no 
other runnable threads. Unfortunately, MTR-LS is based on service fractions and CPU 
resource consumption. It does not have an inherent notion of urgency, which may be 
expressed with priorities. 

Our solution is to extend the MTR-LS algorithm to handle the system threads as a 
special case. We took advantage of the fact that MTR-LS schedules threads according 
to their positions on the service list L, and that the time stamps which determine the 
positions of threads on L can serve as effective priorities. 

Under the MTR-LS algorithm, a thread will not be scheduled if there are other 
threads ahead of it on the service list. Hence, if high priority system threads are placed 
at the beginning of the list ahead of all user threads, then as soon as they become run- 
nable they will be scheduled. Similarly, if the low priority system threads are placed at 
the rear of the list after all user threads, then they will not be scheduled until there are 
no user threads that are runnable. 




Figure 2. Positions of the System Threads and User Threads on the Service List L. 

A map of positions of the system threads and user threads on the service list L is 
shown in Figure 2. The ClockHandler thread and the TimeSlicer thread are assigned 
the two earliest (largest) time stamps. This puts them at the beginning of the service 
list. At the same time, the idler thread is assigned an artificially late (small) time stamp 
which puts it near the end of the service list'. Moreover, all system threads are tagged 
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so that when context is switched away from them, their left values are not updated. As 
a result, their positions on L are stationary. 

This approach takes full advantage of the natural ordering of threads on the ser- 
vice list and the parallel between time stamps and global priorities. It provides a simple 
and straightforward solution to the problem of using the MTR-LS algorithm in a root 
scheduler which must allow system threads to express their urgency. 

3.3 High Level API 

The enhanced capabilities of Q-JVM are made available to the application layer via a 
high level API consisting of two Java classes: the QThread class and the QThread- 
Group class. These two classes supersede the java. lang. Thread class and the 
j ava . lang . ThreadGroup class, respectively. 

The QThread class provides CPU resource management capabilities to Java 
threads. It supports all normal threads operations such as thread creation, termination 
and communication. At the same time, it supports specification of resource require- 
ments in terms of service fraction reservations. The QThreadGroup class provides 
support for CPU bandwidth partitioning for groups of threads. 

The j ava . lang . Thread class and the j ava . lang . ThreadGroup class were 
also re-implemented to provide compatibility. These classes adapt the classic Thread 
API to Q-JVM in order to support unmodified legacy Java programs. 

3.4 Experiments and Results 

The preliminary performance evaluation was done on a Sun SPARCserver 20 with a 
60MHz Ultra-SPARC CPU and 64 MB of main memory running Solaris 2.4. The 
experiments were conducted in multi-user mode with the standard complement of dae- 
mons like Ipd, sendmail, NFS, and a very lightly loaded FITTP server. Moreover, all 
experiments were conducted when there was no interactive user activities. 

Most of the experiments were carried out using the RaceTest test suite. This test 
suite is a multithreaded Java program that simulates a CPU intensive application. It is 
modeled after the well known Dhrystone benchmark. When the application starts, it 
spawns a number of threads, called runners, each of which executes an arithmetic cal- 
culation repeatedly in a loop. The number of loops completed in a given time by one or 
more runners is used as the performance metric. 

Two versions of the test suite were constructed. The first one runs on the enhanced 
JVM and allows an operator to specify the number of runners to start and the service 
fraction for each runner. The other version runs on the standard JVM. Instead of ser- 
vice fractions, it allows an operator to specify priorities for the runners. Otherwise, the 
two versions are identical. A standard Java virtual machine was built from the un-mod- 
ified JVM source from Sun to run the standard version of RaceTest. 

1 . The last few positions on L are reserved for implementing a JVM system function which sus- 
pends all user threads before the garbage collector is invoked. The original implementation 
sets the priorities of all user threads to -1. Our implementation moves the user threads to a 
position on L that is behind the idler thread. 
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A major concern in using a complex resource management scheme such as the 
one built into our enhanced Java virtual machine is that the scheduling overhead may 
be high. To evaluate this overhead, we compared the aggregate average throughput 
achieved by the Race Test application on the enhanced platform versus the standard 
JVM. The results are presented in Table 1 below. 



Table 1: Effect of Scheduling Overhead on Thread Throughput 





1 Runner 


4 Runners 


\Q Runners 


Standard JVM 


602,797 


602,084 


601,040 


Q-JVM 

Configuration 1 


601,399 


599,860 


599,1 U 


Q-JVM 

Configuration 2 


601,582 


600,939 


599,882 



All reported figures in the above table are Aggregate Average Throughput in 
loops completed per second. This figure is arrived at as follows. One thousand 
throughput samples are taken for each runner. If there is only one thread, the average is 
reported. When there is more than one runner, the sum of their averages is reported. 

On the standard JVM, all threads (runners) are started with priority 5. On the 
enhanced JVM, all threads are assigned 1.5% of total CPU bandwidth in configuration 
1. In configuration 2, all threads are assigned equal service fractions totaling 80%. For 
example, when there are 10 threads, each is assigned a service fraction of 8%. 

From Table 1 we can observe that the throughput differences among all figures are 
very small: throughput of the application running on Q-JVM is only 0.19% and 0.36% 
lower than the same application running on the standard JVM. As the application is 
CPU intensive, and does not involve any I/O or kernel service call, this result indicates 
that the additional scheduling overhead of the new platform is very small compared to 
the standard Java virtual machine. 

With the same application and similar test procedures, we also demonstrated that 
Q-JVM is able to provide predictable resource allocation and resource partitioning 
with service fraction specifications. Table 2 shows result from one test run with 2 run- 
ners. One is assigned service fractions of 70% and the other 20%; their respective 
throughputs are 73% and 26% of the total throughput. . 



Table 2: Resource Allocation and Partitioning. 





Service Fraction 


Throughput (% of total) 


Runner 1 


70% 


442,728 (74%) 


Runner 2 


20% 


158,069 (26%) 



In addition to the RaceTest application, a number of standard test suites and appli- 
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cation such as the CqffeineMark suite, the Java Media Framework, and the HotJava 
browser were successfully executed on the Q-JVM without any modifications. The 
results indicated that the new platform is binary compatible with the standard JVM 
distributed by Sun. 

4 Summary 

In this paper, we presented our work on supporting soft real-time tasks on the Java 
platform. We developed a new JVM that is able to support resource allocation and reg- 
ulation of consumption of the CPU resource. 

We extended a service fraction-based resource management algorithm, the Move- 
to-Rear List-Scheduling algorithm, to handle system threads that must express their 
urgency, normally expressed using priorities. We then incorporated this algorithm into 
a new JVM based on the source code of the JVM licensed from Sun. 

The result of our work is a new Java platform that is able to support soft real-time 
tasks. It provides mechanisms for resource allocation and management, as well as 
guarantees for a number of quality of service parameters like fairness, bandwidth par- 
titioning and cumulative service. 

Preliminary test results indicate that our scheme for resource accounting and man- 
agement is viable and the new Q-JVM is indeed able to provide these QoS guarantees 
as specified. Yet, its scheduling overhead is similar to that of the standard version. 
Moreover, our JVM is binary compatible with the standard JVM distributed by Sun. 

References 

1. K. Arnold and J. Gosling. The Java Programming Language, Second Edition. Addision- 
Wesley, Reading, Massachusetts. 1998. 

2. J. Bruno, E. Gabber, B. Ozden, and A. Silberschatz. Move-To-Rear List Scheduling: a new 
scheduling algorithm for providing QoS guarantees. In Proceedings of The Fifth ACM 
International Multimedia Conference, pp. 63-73, November 9-13, 1997. 

3. A. Demers, S. Keshav, and S. Shenker. Analysis and Simulation of a Eair Queueing 
Algorithm. In Proceedings of ACM SIGCOMM, pp. 1-12, September 1989. 

4. S. J. Golestani. A Self-Clocked Fair Queueing Scheme for Broadband Applications. In 
Proceedings of the 13th Annual Joint Conference of the IEEE Computer and 
Communications Societies on Networking for Global Communication. 2:636-646, IEEE 
Computer Society Press, June 1994. 

5. P. Goyal, X. Guo, and H.M. Vin. A Hierarchical CPU Scheduler for Multimedia Operating 
Systems. In Proceedings of the Second USENIX Symposium on Operating System Design 
and Implementation, pp. 107-121, October 1996. 

6. JavaSott. Multithreaded implementation and Comparisons, A W rite Paper. Available at 
nnp://soiaris.]avasojt.com/aeveioper/news/wniiepapers/miwp.nimi. Part No.: 96168-001. 
JavaSoft. April 1996. 

7. J. Nieh et ah, “SVR4 UNIX Scheduler Unacceptable for Multimedia Applications”, 
Lecture Notes in Computer Science, Vol. 846, D. Shepherd et al. (eds.). Springer Verlag, 
Heidelberg, Germany, 1994, pp. 41-53. 

8. I. Stoica et. al., A Proportional Share Resource Allocation Algorithm for Real-Time, Time- 
Shared Systems. In Proceedings of the IEEE Real Time Systems Symposium, Dec. 1996. 




Evaluating the XMT Parallel Programming Model 



Dorit Naishlos' *, Joseph Nuzman^ ^ *, Chau-Wen Tseng' and Uzi Vishkin^ ^ 4 * 

' Dept of Computer Science, University of Maryland, College Park, MD 20742 
^ Dept of Electrical and Computer Engineering, University of Maryland 
College Park, MD 20742 

' University of Maryland Institute of Advanced Computer Studies, College Park, MD 20742 
4 Dept of Computer Science, Technion - Israel Institute of Technology, Haifa 32000, Israel 
{ dorit , jnuzman, tseng, vishkin}@cs . umd . edu 



Abstract. Explicit-multithreading (XMT) is a parallel programming model de- 
signed for exploiting on-chip parallelism. Its features include a simple thread 
execution model and an efficient prefix-sum instruction for synchronizing 
shared data accesses. By taking advantage of low-overhead parallel threads and 
high on-chip memory bandwidth, the XMT model tries to reduce the burden on 
programmers by obviating the need for explicit task assignment and thread 
coarsening. This paper presents features of the XMT programming model, and 
evaluates their utility through experiments on a prototype XMT compiler and 
architecture simulator. We find the lack of explicit task assignment has slight 
effects on performance for the XMT architecture. Despite low thread overhead, 
thread coarsening is still necessary to some extent, but can usually be automati- 
cally applied by the XMT compiler. The prefix-sum instruction provides more 
scalable synchronization than traditional locks, and the simple run-until- 
completion thread execution model (no busy-waits) does not impair perform- 
ance. Finally, the combination of features in XMT can encourage simpler paral- 
lel algorithms that may be more efficient than more traditional complex ap- 
proaches. 



1 Introduction 

When discussing parallel programming models, the parallel computing community 
usually considers two models: message-passing and shared-memory. Both models 
usually require domain partitioning and load balancing. For dynamic, adaptive appli- 
cations this effort can amount to 25% of the entire code and become a significant 
source of overhead [9], [12]. Message-passing in addition requires distributing data 
structures across processors and explicitly handling inter-processor communication. 
Performance also decreases for fine-grained parallelism under both models, as the 
effects of synchronization and communication overhead become a bigger factor. 

Many of these issues, however, are of lesser importance for exploiting on-chip par- 
allelism, where parallelism overhead is low and memory bandwidth is high. This ob- 
servation motivated the development of the Explicit Multi-threading (XMT) pro- 
gramming model. XMT is intended to provide a parallel programming model which is 
simpler to use, yet efficiently exploits on-chip parallelism. 
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Previous papers on XMT have discussed in detail its fine-grained SPMD multi- 
threaded programming model, architectural support for concurrently executing multi- 
ple contexts on-chip, and preliminary evaluation of several parallel algorithms using 
hand-coded assembly programs [14] [6]. A more recent paper describes the prototype 
XMT programming environment, including the XMT compiler and simulator [11]. In 
this paper, we describe features of the XMT programming model that were designed 
for exploiting on-chip parallelism, and evaluate their impact on both programmability 
and performance using the XMT programming environment. 

The main contributions of this paper are as follows: 

■ We discuss and evaluate features of XMT designed to exploit on-chip parallelism. 

■ We examine their effect on programmability for several interesting application 
kernels. 

■ We experimentally evaluate the impact of XMT features on performance using 
the XMT compiler and simulator. 

We begin by reviewing the XMT multi-threaded programming model. We then 
briefly discuss the XMT architecture and environment, including a compiler and be- 
havioral simulator. We examine the impact of each feature of XMT on programmabil- 
ity and performance. Finally, we present a comparison with related work and con- 
clude. 



2 XMT Programming Model 

The basic premise behind XMT is that instead of forcing the hardware to find instruc- 
tion-level parallelism at run-time, the instruction set architecture should provide pro- 
grammers (or the compiler) with the ability to explicitly specify parallelism when it is 
available. 

The XMT architecture is specifically designed to exploit parallelism in an on-chip 
environment, in a manner that directly affects the way programs are written in XMT. 
The XMT architecture attempts to provide more uniform memory access latencies, 
taking advantage of faster on-chip communication times. In addition, a specialized 
hardware primitive (prefix-sum), exploits the high on-chip communication bandwidth 
to provide low overhead thread creation. These low overheads allow to efficiently 
support fine-grained parallelism. Fine granularity is in turn used to hide memory la- 
tencies which, in addition to the more uniform memory accesses, supports a program- 
ming model where locality is less of an issue. The XMT hardware also supports dy- 
namic load balancing, relieving the programmers of the task of assigning work to 
processors. The programming model is simplified further by letting threads always run 
to completion without synchronization (no busy-waits), and synchronizing accesses to 
shared data with a prefix-sum instruction. All these features result in a flexible pro- 
gramming style, that encourages the development of new algorithms, and as such, is 
expected to target a wider range of applications. 

The programming model underlying the XMT framework is an arbitrary CRCW 
(concurrent read concurrent write) SPMD (single program multiple data) program- 
ming model. In the XMT programming model, an arbitrary number of virtual threads. 
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initiated by a spawn and terminated by a join, share the same code. At run-time, dif- 
ferent threads may have different lengths, based on control flow decisions made at run 
time. The arbitrary CRCW aspect dictates that concurrent writes to the same memory 
location result in an arbitrary one committing. No assumption can be made before- 
hand about which will succeed. This permits each thread to progress at its own speed 
from its initiating spawn to its terminating join, without ever having to wait for other 
threads; that is, no thread busy-waits for another thread. An advantage of using this 
SPMD model is that it is an extension of the classical PRAM model, for which a vast 
body of parallel algorithms is available in the literature. 

The user-level XMT language is an extension of standard C. The following exam- 
ple XMT program copies all non-zero values of array ,4 to S in an arbitrary order: 

m = 0 ; 
spawn (n, 0) / 

{ 

int TID; 

if (A[TID] != 0) { 

int k = ps (&m, 1) ; 

B [k] = A [TID] ; 



j oin { ) ; 



The programming model has a number of key features: 

■ Explicit spawn-join parallel regions. 

■ Shared accesses synchronized with prefix-sum instruction. 

■ Threads run to completion (do not busy- wait). 

■ Dynamic forking of additional virtual threads. 

A parallel region is delineated by spawn and join statements. Every thread execut- 
ing the parallel code is assigned a unique thread ID, designated TID. Shared accesses 
are synchronized with a prefix-sum instruction (ps), similar to an atomic fetch-and- 
increment. It can be combined by the hardware to form a multi-operand prefix-sum 
operation. XMT does not allow for nested initiation of a spawn within a parallel 
spawn region [14], but a thread can perform a fork operation to introduce a new vir- 
tual thread as work is discovered [15]. 

3 XMT Architecture 

The XMT programming model allows programmers to specify an arbitrary degree of 
parallelism in their code. Clearly, real hardware has finite execution resources so all 
threads can’t run simultaneously. In an XMT machine, a thread control unit (TCU) 
executes an individual virtual thread. Upon termination, the TCU performs a prefix- 
sum operation in order to receive a new thread ID. The TCU will then emulate the 
thread with that ID. All TCUs repeat the process until all the virtual threads have 
been completed. This functionality is enabled by support at the instruction set level. 
With our architecture, all TCUs independently execute a serial program. Each accepts 
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the standard MIPS instructions, and possesses a standard set of MIPS registers locally. 
The expanded ISA includes a set of specialized global registers, called prefix-sum 
registers (PR), and a few additional instructions. 

New instructions were added for thread management. A spawn instruction inter- 
rupts all TCUs and broadcasts a new PC at which all TCUs will start. The pine in- 
struction operates on the PR registers and performs a parallel prefix-sum with value 1 . 
A specialized global prefix-sum unit can handle multiple pines to the same PR register 
in parallel. Simultaneous pines from different TCUs are grouped and the prefix-sum is 
computed and broadcast back to the TCUs. This process is pipelined, and completes 
within a constant number of cycles, pread performs a parallel read (prefix-sum with 
value 0) of a PR register, and pset is used (serially) to initialize a PR register. 

The psm instruction allows for communication and synchronization between 
threads. It performs a prefix-sum operation with an arbitrary increment to any loca- 
tion in memory. It is an atomic operation, but due to hardware limitations, is not per- 
formed in parallel (i.e., concurrent psm’s will be queued). This is equivalent to a 
fetch-and-increment. A ps command at the XMT-C level is translated to a psm in- 
struction. 

Additional instructions exist to support the nested forking mechanism. Forks from 
many TCUs can be performed in parallel batches. 

4 XMT Environment 

The XMT environment consists of a prototype XMT compiler and behavioral simula- 
tor. The XMT compiler consists of two passes. The front end is a translator based on 
the SUIF compiler system [17] that converts XMT constructs into regular C code with 
assembly templates. It also detects all parallel regions delineated by spawn-join state- 
ments and transforms them into parallel function calls. The back end is based on 
gnu’s gee and builds an executable for the C code output by the front end [11]. 

The XMT behavioral simulator is comparable to SimpleScalar [2]. The fundamen- 
tal units of execution for the simulated machine are the multiple thread control units 
(TCUs), each of which contains a separate execution context. In hardware, an indi- 
vidual TCU basically consists of the fetch and decode stages of a simple pipelined 
processor. To increase resource utilization and to hide latencies, sets of TCUs are 
grouped together to form a cluster. The TCUs in a cluster share a common pool of 
functional units, as well as memory access and prefix-sum resources. The clusters can 
be replicated repeatedly on a given chip [3]. 

For our experiments, we specify 8 TCUs in each cluster. Each cluster contains 4 
integer ALUs, 2 integer multiply/divide units, 2 floating point ALUs, 2 floating point 
multiply/divide units, and 2 branch units. All functional unit latencies are set to the 
SimpleScalar sim-outorder defaults: 1 cycle for integer ALU ops, 3 cycles for integer 
multiply, 20 cycles for integer divide, 2 cycles for floating point ALU ops, 4 cycles 
for floating point multiply, 12 cycles for floating point divide, and 24 cycles for 
square root. Each cluster has a LI cache of 8 KB, and a shared, banked L2 cache of 1 
MB. The number of banks is chosen to be twice the number of clusters. The L2 cache 
latency is 6 cycles and memory latency 25 cycles. A penalty of 4 cycles is charged 
each way for inter-cluster communication. 
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5 XMT Programming and Performance 

Traditional shared memory programming consists of assigning chunks of work to 
processes, usually as coarse-grained as possible, while locks and barriers are used for 
synchronization. The following are the main programming concepts that distinguish 
XMT from traditional parallel programming: 

1 . No task assignment. 

2. Fine-grained parallelism. 

3. No busy- wait. 

In this section we examine each of the above features, and how they affect the way 
programs are written in XMT. Furthermore, we study the effects of these features on 
performance. 

5.1 No Task Assignment 

XMT relieves the programmer from the task of assigning work to processors. The 
programmer can think in terms of virtual threads, without worrying about low-level 
considerations such as the number of processing units and load balance between them. 
Instead of directly spawning a thread for each block of work, traditional parallel pro- 
gramming puts a lot of effort into the task of grouping blocks of work together in a 
good way (with respect to load-balance and locality). This effort is translated to pro- 
grammer time, line count, and code complexity. Sometimes, a program benefits from 
grouping because it may allow saving duplicate work. If the blocks of work are very 
small, this grouping serves as thread coarsening. However, in some cases, by incurring 
extra work, it can even result in a less efficient program. 

We evaluate the effects of task composition on each of three programs (mmult, 
convolution and dag) by comparing two versions - XMT and Traditional. The tradi- 
tional version always spawns exactly one thread for each TCU, and a loop is added to 
the thread body to span through all the blocks of work that are assigned to the thread 
(Table 1). 

Table 1. Effects of task decomposition on programming of parallel dot product computation. 



XMT 


Traditional 


spawn (N*N, 0 ) / 


spawn ( tcus , 0 ) ; 


{ 


{ 

lb =N*N*TID/tcus; 
ub =N*N* (TID+1) /tcus; 
for (m=lb;m<ub;m++) { 


X = C [0] [TID] ; 


X = C [0] [m] ; 


i = TID/N; j = TID%N; 


I = m/N; j = m%N; 


for (k=0; k<N; k++) { 


for (k=0 ;k<N;k++) { 


X += A[i] [k] *B[k] [j] ; 

1 


X += A[i] [k]*B[k] [j] ; 

1 


/ 

C[0] [TID] = X; 


1 

C [0] [m] = X; 

} 


} 

j oin ( ) ; 


} 

j oin ( ) ; 
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Note that even when the entire task of assignment involves only the few source 
lines (in bold font), we may still get up to 30% increase in length of code, in some 
cases accompanied with a decrease in performance. 

We now examine the experimental results for the three programs (Figure 1). For 
both matrix multiplication and 2-D image convolution, the XMT version spawns a 
thread for each entry, while the traditional version clusters the entries, and spawns a 
thread for each cluster. A more irregular computation, the third program finds maxi- 
mum paths in a DAG. The XMT version spawns a thread for each node, while in the 
traditional version each thread handles a cluster of nodes. (Note that we choose to 
show here the “xmt-sync” variant, as it is the closest to traditional version, though not 
the most efficient. Details on this and other DAG implementations appear in section 
5.3.) 




Fig. 1. Effects of no task assignment on performance. Speedups of parallel versions over the 
serial version, running on 64 TCUs. 

For mmult and convolution, both versions achieve similar speedups. The traditional 
rmnult is able to amortize some duplicate work, while the XMT version of convolution 
takes the lead by avoiding some task assignment overhead. For DAG, load balancing 
issues come into play, penalizing static assignment in the traditional version. For 16 
TCUs, costs due to load imbalance constitute 35% of the traditional program running 
time versus 16% for XMT. 



5.2 Fine-Grained Parallelism 

The XMT programming methodology encourages the programmer to express any 
parallelism, regardless of how fine-grained it may be. The low overheads involved in 
emulating the threads allow this fine-grained parallelism to be competitive. However, 
despite the efficient implementation, extremely fine-grained programs can benefit 
from coarsening as it may allow to 1) exploit spatial locality, and 2) reduce duplicate 
work; however, these opportunities occur only in regular codes, such as scientific 
array-based computations, where it is also easy to automatically detect and optimize. 
By grouping consecutive threads, clustering exploits spatial locality, and allows the 
programmer to ignore granularity and task assignment considerations, which are oth- 
erwise relevant. 
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The XMT compiler detects cases where the length of a thread is sufficiently small 
(such that the thread overhead constitutes a significant enough portion of the thread). 
This parameter is evaluated at compile time, using SUIF’s static performance estima- 
tion utility. The compiler then automatically transforms the spawn block such that 
fewer but longer threads are used. In the results we report here, we use grain size as 
big as possible, where the number of clustered threads is fixed to the number of 
TCUS. A few unclustered-threads are reserved at the tail of the spawn. Tuning this 
value can reduce the load imbalance cost. 

To evaluate the effects of granularity on performance, we use the following three 
programs: LU, a linear algebra program that computes lower-upper matrix factoriza- 
tion. Jacobi, a 2-D PDE kernel, and dbscan - a database kernel that emulates an SQL 
query on a non-indexed attributes relation. We compare several versions for each: 1) 
fine-grained: each thread handles one entry, 2) by-row: each thread computes an entire 
row, 3) clustered: the fine-grained version coarsened by the XMT compiler, 4) tradi- 
tional: by-row, with the work assigned to processors (as described in the section 
above). (For dbscan we compare only 3 versions - fine-grained, clustered, and tradi- 
tional). 

All three programs demonstrate the same behavior (Figure 2). For the smallest 
problem sizes, the best speedups are achieved by the fine-grained version, where the 
coarsest version (traditional) gets the lowest speedups. As the problem size increases, 
the coarser versions beat the fine-grained one, while the by-row version is the fastest 
version. Though the advantage that the fine-grained versions demonstrate for the 
smaller sizes is not decisive, it suggests that in the general case, algorithms for more 
irregular applications may benefit. As an example, in section 5.4 we describe how our 
implementations of radixsort and quicksort take advantage of fine-grained parallelism. 

For more regular applications however, the XMT compiler can make fine-grained 
programs competitive with the coarse-grained versions by automatically clustering the 
parallel regions. This allows the programmer to ignore granularity considerations, and 
directly write the easier fine-grained version. 



5.3 Synchronization: Scalable Mechanisms and Asynchronous Algorithms 

Traditional parallel programs typically use locks and barriers for synchronization. As 
we will demonstrate, these synchronization mechanisms can be efficiently supported 
in XMT. However, typically XMT addresses synchronization in a manner that mini- 
mizes busy-waiting. In many cases, the parallel prefix sum operation can be used 
instead of a lock, and the join serves as a barrier. Alternatively, an XMT methodology 
can often suggest an entirely different algorithm. We examine these alternatives with 
two representative programs, dot and DAG. We show results for relatively small 
input sizes, where the relative costs of the various techniques are easily seen. While 
differences tend to be less dramatic with larger workloads, the same trends are evi- 
dent. 
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Fig. 2. Effects of granularity on performance. Speedups of parallel versions over the serial 
version. 

Dot product is an example of a common reduction task. We compare the following 
versions for dot: 

1. XMT-algorithmic-style programs, using a binary tree structure for no-busy- wait 
synchronization [15]. We wrote two programs in this style, one propagates values 
up the tree in a synchronous fashion, involving a spawn and join for each layer of 
the tree (“xmt-sync”). The other version propagates the values in an asynchronous 
fashion, involving only one spawn-join block, within which the threads advance 
as far as they can (”xmt-async”). 
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2. Traditional style programs, in the sense that the problem is decomposed in a 
coarse-grained fashion between the processing units. However, XMT utilities are 
used for synchronization. In the “x_trad-ps” version, the TCUs update the global 
dot product atomically with their portion of the computation using the parallel 
prefix-sum mechanism, instead of locking. In the “x trad-join” version, after all 
the TCUs have completed computing their portion, they join, instead of using a 
barrier, and the global dot product is computed serially after the join by a single 
TCU. 

3. Traditional style programs, using busy-wait synchronization. The “trad_PRlock” 
uses the most efficient but somewhat specialized parallel prefix-sum operation, 
while “trad_memlock” uses a less scalable but more general fetch-and-add 
mechanism, “trad-barr” uses a barrier. 

The general trend is that programs using non-XMT synchronization scale poorly 
with the number of TCUs compared to the others (Figure 3). The exception is the 
“xmt-async” version due to the amount of storage and extra work that it involves. 

This trend in synchronization primitive scalability is further reinforced when exam- 
ining the performance of different versions for the DAG program. In addition, the 
irregular nature of the computation makes it a good candidate for a less synchronous 
programming style. Especially with sparser graphs, asynchronous programs should 
excel by enabling parallelism as soon as it is discovered. We compare the following 
versions [15]: 

1 . XMT style, synchronous fashion: “sync”. A thread is spawned for each node with 
in-degree 0. Each thread steps through the outgoing edges of the node, and dec- 
rements the in-degree of the sink nodes. After a join, threads are spawned for 
newly in-degree 0 nodes. The process is repeated until all nodes are processed. 

2. Traditional style. These versions are based on the “sync” version, adding the 
necessary decomposition. One program uses the prefix-sum as a synchronization 
mechanism (“trad”), and another uses locks instead (“trad-lock”). 

3. XMT style, asynchronous fashion: “async-node” and “async-edge”. The async- 
node version spawns a thread for every node as in “sync”. Whenever a thread 
decrements the in-degree of a sink node to zero, it forks a thread to process that 
node. Async-edge is similar, but a thread processing a node forks a thread to pro- 
cess every outgoing edge. Async-edge is the finest-grain, least-synchronous of 
the versions. 

As expected, the more synchronous the algorithm is, the worse it scales with the 
number of TCUs. This trend can be seen by comparing the relative performance of 
async-edge, async-node, and sync (Figure 3). 

We again see the effect of synchronization mechanism overhead on scalability. 
The relative performance of trad and trad-lock illustrates the superiority of prefix-sum 
compared to traditional busy-wait locking. 
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Fig. 3. Synchronization options in dot (4096 elements), and dag (100 nodes, 473 edges). 
Speedups of parallel versions over the serial version on increasing number of TCUs. 



5.4 Two Case Studies 

In this section we provide two examples of how the combination of features in 
XMT enables new approaches that outperform more traditional algorithms. 

Radix. This integer sort consists of the stable radix sort routine that is applied 
iteratively to each “digif’ of the input until fully sorted. For every value a digit might 
take, we say there is a bin for the key to go in. By counting the number of keys for 
each bin, we can determine a final rank for each bin. 

SPLASH radix uses the common parallel algorithm of [1]. It operates with P proc- 
essors on N input keys. Each processor is assigned a continuous partition of N/P keys, 
and locally computes bin counts from its own partition. Since each processor has its 
own set of bin counters, a global ranking operation must be performed. A binary tree 
of bin counter arrays is formed. Each processor ranks its bins locally and then either 
sums results from other processors further up the tree, or waits for partial sums to be 
propagated back down the tree. Finally, with globally ranked bins, each processor can 
copy each of its keys in sequence to the output array. 

One of the hurdles to scalability is that each processor needs its own set of bin 
counters. The more processors that are applied to a problem, the more time must be 
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spent doing global ranking. This effectively limits the P for the SPLASH radix. In an 
XMT style program, significantly more parallelism can be expressed. 

Due to the need to preserve equal-key order, we are limited to the P sequential 
threads to serially copy keys within individual partitions. However, the bin counting 
step can be much more fine-grained. Provided that threads cope with simultaneous 
access to the same bin counter (e.g. with the prefix-sum primitive), it is possible to 
spawn a separate thread for each key of input. 

The global ranking step can also be finer-grained. SPLASH operates with a granu- 
larity of an entire bin counter array. The XMT approach does not insist that the pro- 
grammer wrestle with data locality. In this light, the global ranking step can be con- 
sidered to be a single, large prefix-sum operation, where the inputs are the interleaved 
bin counters from all the processors. It is then a simple matter to apply a fine-grained 
binary-tree parallel prefix algorithm. 

There is certainly a cost to the more fine-grained algorithm. Without automated 
coarsening of the XMT threads, the coarser-threaded bin counting clearly has the 
work advantage. With 1 TCU, the XMT algorithm performs nearly twice the work of 
SPLASH (Figure 4). However, with 16 TCUs, the SPLASH algorithm is already do- 
ing worse than it did with 4 TCUs. In contrast, the finer-grained program easily scales 
to 16 TCUs. (Note that these results are for an input of 16K keys, while SPLASH 
defaults to 256K keys.) 

Quicksort. Quicksort provides another simple illustration of how XMT might be able 
to express parallelism that is traditionally neglected. A classic example of the divide- 
and-conquer approach, quicksort is a recursive sort using pivots. 

The traditional parallel quicksort follows this divide-and-conquer approach. After 
splitting a partition, a thread forks a new thread to sort the right side, and does the left 
side itself. 

In XMT, the forking need not stop when the full machine parallelism is reached. 
Newly created partitions are queued up, and automatically assigned to threads without 
complicated programmer intervention. This dynamic load balancing is particularly 
useful for handling the unpredictable partitioning of quicksort. 

This approach is still less than ideal. At the begiiming of the program, only one 
thread is active. A machine can not be utilized fully until enough partitions have been 
created. Unfortunately, the partitions encountered at the start of the computation are 
the largest and therefore take the longest time to split. 

A fine-grained XMT program can parallelize the partition step. A thread is 
spawned for each element of the partition. Two counters - “left” and “right” are 
maintained. The left (right) counter keeps track of the number of elements which value 
is less (greater) than the pivot, and are thus copied to the left (right) partition. A pre- 
fix-sum to either the left counter or the right counter determines the output rank of the 
element. 

Since the latter method requires synchronization at each partition, we do not wish 
to use this algorithm throughout the program. The optimal solution is to start with the 
fine-grained, synchronous parallel partitioning algorithm, and then switch to the tradi- 
tional divide-and-conquer when a sufficient number of partitions are available. With a 
64 TCU configuration operating on 16,384 elements, this hybrid approach is more 
than twice as fast as the traditional approach. 




106 



Dorit Naishlos et al. 



Radix 16384: Speedups 



xmt — ■ — splash 




Fig. 4. XMT vs. SPLASH implementation for radix sort. Speedups of parallel ver- 
sions over the serial version on increasing number of TCUs. 

6 Related Work 

Recent work on comparing different parallel programming models [9], [5], [12], [4], 
typically focuses on the shared-memory and message-passing programming models on 
multiprocessor systems. Our work attempts to examine parallel programming with 
respect to the different assumptions implied by an on-chip environment. 

Various other projects explore on-chip parallel architectures: CMP [10], Multisca- 
lar [8], SMT [13]. The current paper is targeted toward exploring shared-memory 
parallel algorithms as applied to scalable on-chip architecture. Note that XMT, with 
the parallel prefix-sum for example, aspires to scale up to much higher levels of paral- 
lelism than other single-chip multithreaded architectures consider currently. Tile based 
architectures, such as MIT’s Raw [16], also expect to scale to high levels of parallel- 
ism. However, Raw utilizes a message-passing model rather than the shared-memory 
model of XMT. In addition, Raw heavily relies on compiler technology to manage 
data distribution and movements between tiles. As such, it is much easier to program 
for the XMT architecture, and it is also expected to address a wider range of applica- 
tions. 

MIT’s Cilk [7] provides a multi-threaded programming interface and execution 
model. There are two important differences in scope. First, since Cilk is targeted at 
compatibility with existing SMP machines, load balancing in software is an important 
element of the project. XMT requires hardware support to bind virtual threads to 
thread control units (TCUs) exactly as the TCUs become available. The low-overhead 
of XMT is designed to be applicable to a much broader range of applications. Second, 
Cilk presents a programming model that tries to match very closely standard serial 
programming constructs, where forking a thread takes the form of a function call. 
While XMT also bases its programming model on standard C, the programmer is 
expected to rethink the way parallelism is expressed. The wide-spawn capabilities and 
prefix-sum primitive are present to support the many algorithms targeted to the PRAM 
model. 
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7 Conclusion 

This paper presented features of XMT, a parallel programming model designed for 
exploiting on-chip parallelism. With prototype compiler and architecture simulator, 
we studied XMT programming in areas where parallel computing has underperformed 
in the past: very fine-grained parallelism; smaller problem sizes; and unpredictable, 
irregular computations. 

XMT features encourage programmers to write high level programs with more fine- 
grained parallelism, without worrying about assigning threads to processors. We 
found the XMT architecture and compiler can usually support this simple fine-grained 
programming style with little loss in performance for on-chip multiprocessors. Thread 
coarsening is necessary in some cases, but can usually be automatically applied by the 
XMT compiler. After thread coarsening, lack of explicit task assignment has only 
slight effect on performance for the XMT architecture. 

When overheads are low enough so that parallelism can be leveraged even for 
small inputs, many more tasks can potentially be sped up. 

The flexible programming style also encourages the development of new algo- 
rithms to take advantage of properties of on-chip parallelism. We demonstrated that 
XMT is especially useful for more advanced applications with dynamic, irregular 
access patterns. 
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Abstract. Most parallel programs use regular topologies to support 
their computation. Since they define the relationship between processes, 
process topologies present an excellent opportunity for debugging. The 
primary benefit is that patterns of expected behaviour can be abstracted 
and identified, and unexpected behaviour reported. 

However, topology support is inadequate in many environments, includ- 
ing the popular Message Passing Interface (MPI). Programmers typically 
implement topology support themselves, increasing the possibility of in- 
troducing errors. Moreover, debugger support that exploits topological 
information is lacking. 

We have undertaken to develop a debugger that exploits topological in- 
formation. This paper presents DEPICT (DEbugger of Parallel but In- 
consistent Communication Traces), a (preliminary) topology-based de- 
bugger for MPI. Currently, DEPICT presents high-level visualisations 
of parallel program communication behaviour, where logically similar 
processes are clearly indicated in a manner that allows the program- 
mer insight into overall program behaviour. To assist in understanding 
unexpected behaviour, DEPICT allows programmers to investigate the 
observed semantic differences between processes. 



1 Introduction 

The parallel programming paradigm is powerful in that it allows scientists and 
engineers to address a variety of computationally expensive problems. The hard- 
ware employed to address such computationally expensive problems ranges be- 
tween dedicated supercomputers and distributed memory multi-processors, to 
the less expensive collection of powerful off-the-shelf personal computers con- 
nected via high-speed networks (workstation clusters). 

A variety environments support the development of parallel programs un- 
der the given hardware environment. One prominent example is the language 
extensions offered by the Message Passing Interface (MPI) which allows 
programmers to enjoy one of the most popular communication paradigms to- 
day: message passing. 

F. Mueller (Ed.): HIPS 2001, LNCS 2026, pp. 109-^3 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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The computing model supported by message passing environments is both 
simple and very general, and accommodates a wide variety of application pro- 
gram structures. The programming interfaces are deliberately straightforward, 
thus permitting simple program structures to be implemented in an intuitive 
manner. The user writes an application as a collection of cooperating tasks. Tasks 
access message passing resources through a library of standard interface routines. 
These routines support the initiation and termination of tasks, as well as commu- 
nication and synchronisation between tasks. Communication constructs include 
those for sending and receiving data structures as well as high-level primitives 
such as message broadcast, synchronisation, and some global arithmetic opera- 
tions. 

Despite its popularity, MPI is not particularly elegant at expressing many 
parallel problems. It is necessary to specify the destination of messages using 
a task descriptor (rank) which uniquely identifies the task to receive the mes- 
sage. This directly contrasts with the method that most parallel programs use 
to describe their communication patterns. Many parallel algorithms employ one 
or more standard process topologies, within which the physical relationship be- 
tween tasks is specified, and the communication patterns are implied. For exam- 
ple, when computing solutions to problems whose data forms a two dimensional 
matrix, or mesh, it is natural for each node to communicate with its immediate 
neighbours in its row or column. When these problems are formulated as algo- 
rithms, it is natural to specify the destination of each message in terms of the 
primary compass directions, rather than using an abstract identifier bearing no 
logical resemblance to the problem being solved. 

A number of standard process topologies are well identified [3- These in- 
clude pipelines, rings, hypercubes, and trees. However, traditional MPI does 
not allow problems to be expressed in terms of these logical communication 
patterns Q. Recent research by Kazemi and McDonald [F] largely addresses 
this issue; extensions to MPI (and the Parallel Virtual Machine (PVM) Q) are 
provided that support the specification of logical communication patterns by 
providing symbolic constants such as SEND_T0_N0RTH, BRDADCAST_TO_ROW, and 
RECV_FRDM_DOWNSTREAM. 

A number of factors complicate parallel program debugging Q. One difficult 
area involves detecting or locating communication errors. Concurrently execut- 
ing processes complicates program understanding, and can obscure the point of 
origin of errors. Namely, errors can originate in processes other than the process 
showing the symptoms of the error. Unfortunately, current debugging techniques 
are too fine-grained, and focus too much on the source code. Often programmers 
must resort to examining traces of inter-process communication pairings, or at- 
tempt to instrument their code with traditional console I/O (ala printfO in 
C programs). Both techniques are flawed. Program traces easily overload pro- 
grammers (and communication networks) with information; often there are large 
volumes of messages, and programmers are limited to identifying processes using 
numeric values. Additional tools are often required to make any degree of sense 
of traces. With console I/O, not only can too much information be produced, but 
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there is a greater degree of program intrusion, which leads to the probe effect, a 
serious issue in non-deterministic parallel programs: changing a program’s tim- 
ing can result in different behaviour, and therefore the manifestation of different 
errors. Minimising the probe effect is important. 

The lack of explicit topology support constrains the way we think about 
debugging parallel programs. Topological knowledge is a valuable debugging 
aid, especially with respect to program understanding. For example, topological 
knowledge can be used to identify misdirected messages. It can also be used to 
better summarise process behaviour. In particular, processes can refer to each 
other via logical descriptors, such as NORTH, instead of numeric descriptors. Such 
knowledge provides a powerful mechanism whereby the behaviour of processes 
can be correlated, and the detection of patterns of events made. Without assis- 
tance, humans find it difficult to identify patterns of events from traces, and even 
from event animations. Topological knowledge is also important if a topologi- 
cally consistent display of the processes is expected. Although generic parallel 
debugging visualisations could be applied, without knowledge of the topology, 
display scalability becomes a serious issue. 

Despite the assistance a topology-based debugger can provide, to our knowl- 
edge there are no such debuggers for MPI. We have undertaken to develop such a 
debugger. This paper presents a preliminary post-mortem trace-based topologi- 
cal debugger for MPI, DEPICT (DEbugger of Parallel but Inconsistent Commu- 
nication Traces). We note that although our work focuses on MPI, the concepts 
are extendible to other environments, such as PVM. 

Debuggers are often used as a last resort. The programmer usually believes 
their code is correct, so they cannot understand why their program is in error. 
Debuggers are employed to help locate where a program does not act as the 
programmer expects it to. A debugger can only try to correct a programmer’s 
incorrect perception of their program’s behaviour. Debuggers cannot be expected 
to explain why an error is present, but they are expected to provide meaningful 
information that the programmer can analyse and act upon. 

DEPICT provides meaningful debugging information by exploiting the fact 
that topology-based programs are expected to exhibit regular, periodic commu- 
nication patterns. Often one or more subsets of correct topology-based programs 
exhibit similar behaviour. By visualising equivalent classes of processes, DEPICT 
provides high-level views of program communication behaviour. DEPICT does 
not just present the trace graphically, it analyses and abstracts it further, mak- 
ing its representation closer to the actual program constructs, and thereby more 
useful. 

The programmer can compare DEPICT ’s output against their mental view of 
the program, where differences identify regions of concern. Since a programmer 
is familiar with their own code, they can correlate the geographical location of 
anomalies against the source code, which suggests the probable source of error. 

In the next section we discuss material related to our research. In Sect.Qwe 
describe in detail the makeup of DEPICT. Examples using DEPICT are given 
in Sect.0 after which, in Sectfl we conclude our paper. 
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2 Related Work 

Processes in MPI are identified by their numeric rank, where there is no require- 
ment that programmers use ranks in a logical or natural manner. Consequently, 
by analysing only trace files, the position of processes within topologies is not 
necessarily clear. In suggesting the topology-based debugging approach, Huband 
and McDonald identify and investigate the “topology identification” issue. 

Huband and McDonald indicate that trace files provide sufficient informa- 
tion for the identification of process topologies, given potentially random task 
descriptors, and the possibility of errors. They suggest using graph distance mea- 
sures as a means of identifying a topology, where, given a set of ideal candidate 
topologies, the idea is to find mappings of processes into each candidate such 
that the number of mismatched edges is minimised in each case. The minimum 
number of mismatched edges is the distance. The candidate with least distance 
to the actual program topology is suggested as the intended topology. 

However, determining optimum mappings of processes is, in degenerate cases, 
an NP-hard problem. In the case of topologies, as more errors are introduced, 
the more difficult the problem becomes. Conversely, in the absence of errors, 
identifying topologies is trivial. Huband and McDonald demonstrate that even 
in the presence of errors, by using topology-specific properties where necessary, 
topology identification is tractable. 

At present we make the reasonable assumption that MPI ranks are used 
naturally, not randomly. This has allowed us to concentrate on other aspects of 
DEPICT, instead of concerning ourselves with worst-case scenarios. 

DEPICT is not the first tool to exploit process topologies. The X-based 
Compound Event Recognition Tool (XCERT) is a topology-based debug- 
ger initially designed for the Fujitsu APIOOO. XCERT, which only supports the 
mesh topology, accepts LERP logs by way of input, analyses them, and simpli- 
fies process traces by introducing compound events and count tags. Compound 
events span only one process and consist of atomic and possibly other compound 
events. Count tags represent the number of times a particular compound event 
is repeated in the trace for a process at the given position. 

XCERT uses a statistical approach to determine compound events, where 
preference is assigned to identifying larger sequences of compound events before 
smaller sequences. Once compound events are identified, they are substituted 
across all process traces. Using reduced task signatures, or task traces using 
compound events where the count tag is ignored, XCERT identifies equivalence 
classes of processes, where two processes belong to the same equivalence class 
if their reduced signature is identical. XCERT displays the equivalence classes, 
allowing the user to see the observed behaviour of their program. 

Although they are for different platforms, some aspects of XCERT are similar 
to DEPICT. However, DEPICT employs a number of more advanced techniques, 
including: 



DEPICT: A Topology-Based Debugger for MPI Programs 113 



— DEPICT employs an algorithmic approach rather than a statistical ap- 
proach, 

— DEPICT allows for a combination of absolute and relative addressing schemes 
instead of requiring the use of one or the other, and 

— DEPICT provides facility for identifying the differences between two process 
execution histories (DEPICT’s functionality is described in Sect.Q. 

Belvedere 0 is a post-mortem debugger where users define abstract events 
(for example, “when all even rows swap values”) that typically span multiple 
processes. Belvedere animates identified events, where individual low-level par- 
ticipating events are animated simultaneously. 

Ariadne □ is another post-mortem tool, where users specify the behaviour 
expected of their program. Ariadne attempts to match the specified behaviour 
against the actual behaviour. The resulting match (or mismatch) is displayed 
using the Ariadne Visualisation Engine. 

Whilst both Belvedere and Ariadne provide powerful facilities, they require 
the user to specify in advance the patterns appropriate to their program. How- 
ever, users tend to be reluctant to use debuggers that require such a learning 
curve. In addition, whilst programmers can verify the behaviour of their pro- 
grams against presented models, not all are capable of clearly specifying be- 
haviour. 

ATEMPT (A Tool for Event ManiPulaTion) QQ, a part of the MAD (Mon- 
itoring And Debugging) environment [u], presents an alternative approach to 
parallel program debugging. ATEMPT uses a partially ordered (in time) space- 
time diagram to depict a program’s communication history. Processes are rep- 
resented by horizontal lines, and the interaction between processes is shown 
by directed lines. ATEMPT provides facilities to detect simple communication 
errors, such as events with different message lengths, isolated events, or race con- 
ditions. ATEMPT also provides some mechanisms to abstract the display on a 
process and event level. Whilst ATEMPT provides some useful functionality for 
the detection of errors, space-time diagrams can become cluttered and confusing 
(although ATEMPT’s abstraction mechanisms alleviate this to some degree). 
Moreover, space-time diagrams do not exploit process topology information in 
that processes are not arranged in a topologically consistent manner. 

ParaGraph 0 is a post-mortem performance debugger that uses the same 
trace file format as DEPICT currently uses, the Portable Instrumented Com- 
munication Library (PICL) extension MPICL (see Sect.Q. ParaGraph provides 
various displays and animations to assist the performance analysis and tuning 
of parallel programs, including some that visually arrange processes according 
to a user selected topology. 

There are numerous other parallel debugging programs and approaches de- 
signed for MPI and other environments, including, among others, the Annai tool 
environment o, and the p2d2 source code debugger 0. 
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3 DEPICT 

DEPICT is a high-level, post-mortem trace-based debugger written in C-|— I- 
with a Tcl/Tk graphical front-end for use with topology-based MPI programs. 
DEPICT’s two key aspects include its visualisation of program behaviour, and 
its process comparison utility. The program visualisation allows programmers 
to gauge the correctness of overall program communication structures, and the 
comparison utility allows programmers to investigate the difference in behaviour 
exhibited by different processes. DEPICT does not analyse the program at the 
source code level, but instead analyses the observable semantics as traced. Ex- 
amples of DEPICT’s appearance can be found in Sect.0 

DEPICT’s components are described below. They include the trace file in- 
terpreter, the topology handler, the group identifier, and the graphical user in- 
terface. A number of components are still being enhanced, in particular the 
topology handler and the graphical user interface. 

3.1 Trace File Interpreter 

MPI programs that are analysed by DEPICT are currently traced using MPICL 
[3, the output of which is read by the trace file interpreter post-mortem. There 
are several factors contributing to our decision to use MPICL, namely: 

— MPICL’s trace format supports our requirements, 

— MPICL’s trace format is human readable ASCII text, simplifying the design 
of the trace file interpreter, and 

— MPICL uses MPI’s profiling interface, making it portable to a number MPI 
implementations . 

Unfortunately, MPICL is not entirely suitable. MPICL was originally de- 
signed for post-mortem performance analysis, not post-mortem error debugging. 
Consequently, MPICL only produces complete trace files when all processes suc- 
cessfully execute MPI_Finalize. Clearly this is unsatisfactory since it excludes 
the analysis of programs that fail catastrophically mid-way through execution. 
As such we are implementing our own robust profiling library as an alternative 
to MPICL. 

3.2 Topology Handler 

The topology handler is responsible for determining the particulars of program 
topologies, given the point-to-point communication trace information (as pro- 
vided by the trace file interpreter) and the topology name as provided by the 
programmer. Collective communication events are not currently used in topology 
analysis. 

Using the available data, the topology handler determines the dimensions of 
the specified topology, identifies the process ordering scheme used (such as row- 
major for a mesh), and provides an abstract interface for the group identifier 
and the graphical user interface. 
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For its first two roles, the topology handler makes the reasonable assumption 
that process rank numbers are used in a logically consistent manner. For exam- 
ple, in a row-major mesh, process rank 0 always corresponds to the first row and 
column. Note that for performance reasons routines such as MPI_Cart_create 
provide options to allow reordering of process rank numbers. Consequently, 
traced MPI_C0MM_W0RLD ranks could look as though they were used haphazardly. 
We note that MPFs profiling interface could trivially be used to disable such 
optimisation features. 

As part of providing an abstract interface for the next two components, the 
topology handler converts the destination and source ranks of SENDs and RECVs 
respectively so that they are represented using relative addresses. For example, 
a process in a mesh could be said to send a message to the process located 
“two rows north, one row east” instead of stating the destination process’ rank 
explicitly. Since it is not clear whether the programmer is using a relative or 
absolute addressing scheme (or a combination), no assumption is made, and 
the absolute addresses are not discarded. Instead, on an event-by-event basis, 
DEPICT automatically deduces whether relative or absolute addresses are used. 

DEPICT currently supports the mesh and torus topologies, where processes 
appear in row-major order. 



3.3 Group Identifier 

The group identifier is responsible for determining sets of processes that exhibit 
logically identical or similar behaviour. Although not described, it is in this 
component where point-to-point communication events are determined to use 
either absolute or relative addresses, or if it remains unclear, possibly both. 

First, the input is split into separate trace streams, one for each process. 
Cycles of repeated patterns of events, or compound events, are then identified; 
compound events do not span multiple processes. Subsequently, process traces 
are reconstructed where patterns of repeated behaviour are represented using 
low-level loop-like constructs. 

Note that although similar, the use of loop-like constructs and compound 
events does not necessarily reflect the actual loop at source code level constructs 
of the program. We are not aware of any trace file format that would guarantee 
the correct identification of actual loop constructs. 

The group identifier classifies processes into groups where two processes are 
assigned the same group if and only if their reconstructed traces are similar. 
Two traces are similar if they appear identical when the number of loop-like 
construct iterations (and sub-loops, and so on) are ignored. Therefore, processes 
executing instructions a variable number of times based upon their position in 
the topology (which occurs in “wavefront” algorithms for example) are assigned 
the same group. 

The group identifier does not make allowances for boundary effects. For ex- 
ample, only the northernmost row of processes in a mesh cannot send messages 
north. Consequently, the northernmost mesh processes would typically exhibit 
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different communication patterns to the remaining processes. A similar obser- 
vation can be made for the westernmost, easternmost, and southernmost mesh 
processes. Whilst the group identifier does not allow for boundary effects, it may 
be desirable to do so in order to see, for example, how a torus approximates a 
mesh. 

The group identifier’s method of classifying processes is only helpful when a 
program executes consistent, regular patterns. This is not always the case, since 
the communication of some programs vary according to the input data (data- 
driven communication patterns), and look irregular from external (trace-based) 
perspectives; messages may be selectively directed to neighbouring processes 
according to the input data in a manner that is not clearly regular. In addition, 
programs that use “tricks” to optimise communication performance may also 
compromise the group identifier. Even so, the regular communication patterns 
employed by most parallel programs are well supported by our approach. 

3.4 Graphical User Interface 

The graphical user interface uses the abstract interface provided by the topology 
handler to display processes in a manner consistent with the identified topology. 
In addition, equivalent processes are displayed in the same colour, and vice versa. 
To account for large topologies, techniques for scaling the display are currently 
under investigation. 

This program behaviour visualisation allows the programmer immediate in- 
sight into the overall characteristics of the program. If the view provided is 
dissimilar to the programmer’s mental model, and assuming the mental model 
is correct, then there must be a communication error present. The inconsistent 
processes indicate where in the source code the programmer could first look for 
errors. 

In addition, the user can request a textual display of the execution history 
for a single group. This is represented using the derived low-level loop-like con- 
structs. 

The graphical user interface also provides a comparison utility, where the 
changes required to convert one group trace to another are identified. The display 
is similar to the single group display, but blue text represents event insertion, and 
red text represents event deletion. The comparison utility assists in identifying 
how particular processes behave differently. Section0 includes an example of this 
display. 



4 Examples of Use 

We have tested DEPICT against a number of algorithms, each with deliberately 
introduced errors. Three of the tests are presented here: a matrix multiplication 
algorithm for the torus, a matrix multiplication algorithm for the mesh, and a 
linear sort algorithm for the linear array (one-dimensional mesh) . The algorithms 
are described elsewhere 
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Due to MPICL’s constraints, errors are such that all programs exit normally. 
Instead of failing catastrophically, each program returns incorrect results. In 
most cases the introduced errors are subtle and easy to overlook. 

As DEPICT’s output is in colour, and this paper is not, screen shots have 
been manually annotated for clarity. 

4.1 Matrix Multiply for the Torus 

This algorithm was executed on an 6 x 6 (logical) torus. Every process is ex- 
pected to exhibit identical behaviour. However, a missing else statement causes 
the easternmost column of processes to ignore incoming messages from the west- 
ernmost column. 

DEPICT’s display (Fig.B highlights the presence of the error: the eastern- 
most column of processes are in their own group, indicating different behaviour. 
Using DEPICT’s comparison query reveals the absent MPI_Recv (Fig.0. 
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Fig. 1. Incorrect matrix multiply, torus. 



4.2 Matrix Multiply for the Mesh 

Like the algorithm for the torus, this algorithm was tested on an 6 x 6 mesh. 
This algorithm was implemented since it exhibits a variety of different process 
behaviours, where: 
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□ 1 DEPICT: CrouD 1 converte<J to flrouD o 
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GATHER. 




COMM FREE. 
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Fig. 2. Comparing group 1 to group 0. 



— the northwest quadrant does not participate, 

— the northeast and southwest quadrants send and receive messages in the 
vertical and horizontal directions respectively, and 

— the southeast quadrant sends and receives messages in both the vertical and 
horizontal directions. 

Process position in the topology also dictates the number of times certain loops 
are executed. Boundary processes behave differently to the norm. FigureQshows 
DEPICT’s output for a correct implementation of this algorithm. 




Fig. 3. Correct matrix multiply, mesh. 



The error introduced to this algorithm involves two MPI_Recv statements 
that receive messages from MPI_ANY_SOURCE, in a situation where messages that 
should be treated differently are expected from two different sources. In our 
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experience, this kind of error is frequently made by novice programmers. The 
incorrect code corresponds to the southeast quadrant of the mesh. 

Upon analysing the trace output, DEPICT presents the display shown in 
Fig0 It is clear that the southeast quadrant is not executing regularly. Display- 
ing the trace for the southeastern processes reveals a relatively confusing jumble 
of events; since the execution history is erratic, DEPICT cannot concisely sum- 
marise the behaviour. 
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Fig. 4. Incorrect matrix multiply, mesh. 



In this example, DEPICT correctly shows that the code governing the south- 
east quadrant of the program is in error. 

4.3 Linear Sort 

Our implementation of the linear sort algorithm has been tested on the 1 x 
18 mesh. In a correct implementation of the linear sort algorithm, all but the 
easternmost and westernmost processes exhibit identical behaviour. The number 
of times the main loop is iterated decreases the further east a process is located. 

Our implementation of the linear sort algorithm is incorrect in that the main 
loop iterates one less than it should, as caused by an incorrect bound. 

As can be seen in Fig.Q the second last process is not behaving the same 
as the other inner processes. Investigation identifies that the final process is not 
executing the primary loop at all, and that the second last process executes the 
primary loop but once, when it should be executed twice. These provide clear 
indications of an error in the bound of the main loop. 
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Fig. 5. Incorrect linear sort. 



This error only manifests itself for some types of input data, although DE- 
PICT always highlights the inconsistency, even when the output is correct for 
the given input. This is because the communication patterns are modified by the 
error, even if the returned result is not. Since DEPICT analyses communication 
patterns instead of program output, DEPICT will always identify these types of 
errors. 

5 Conclusion 

Topological knowledge is clearly useful for a (constrained) class of debugging 
purposes. In each of the tests, DEPICT provided insight regarding the source 
and type of error. Given none of the programs crashed, and the subtlety of the 
errors, a programmer could plausibly overlook such errors. Nonetheless, DEPICT 
clearly indicated the presence of errors, regardless of actual output, since the 
communication semantics were incorrect. DEPICT could easily be employed 
to verify the correctness of communication patterns for programs that seem 
otherwise correct. 

We do not always expect the topological debugging approach to be useful. 
Programs that exhibit no regular communication structures are not analysed, 
and traditional errors, such as invalid memory references, are not addressed. We 
expect traditional errors to be handled by more traditional debuggers: DEPICT 
is a debugger aimed at the high-level observable semantics of programs, not 
low-level concerns, although existing tools could be incorporated. 

DEPICT is still in preliminary development, although much of the ground- 
work has been implemented. Our goal is for a complete debugger that exploits 
topological information. To this end, we have not ruled out the possibility of 
including features available in other tools. For example, we could include the 
facility to check that all SENDs are matched by REG Vs. At present, we are con- 
centrating on developing new features unique to parallel debuggers. 
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Abstract. We present an algorithm for correcting communication er- 
rors using delivered and undelivered messages. It is used to suggest cor- 
rective measures to remove errors introduced by typographical errors in 
message passing systems like PVM and MPI. The paper focuses on the 
validity of the algorithm by proving that for a nontrivial number of errors 
the algorithm can suggest changes to correct the errors. The algorithm 
has been implemented as a tool in Millipede fM.ulti Lerel Interactive 
Parallel Debugger), which is a support environment developed to assist 
programmers to debug message passing programs at different abstraction 
levels. 

1 Introduction 

Detecting and correcting communication errors in message passing programs is 
a difficult problem. According to C. M. Pancake [3] the time spent debugging is 
comparable to the time it takes to write the program initially. Even simple com- 
munication errors are difficult to debug in a parallel environment with multiple 
processes exchanging large numbers of messages. Although there are visualiza- 
tion tools [1,2] to help users visualize the communication patterns of parallel 
programs there are very few tools for detecting and correcting errors based on 
the user’s source code. 

In this paper we introduce an algorithm that finds the fewest number of changes 
necessary to transform a program that deadlocks due to a communication error 
to one that does not deadlock. The algorithm is directed towards the type of 
communication errors that occur because of typographical errors in send and 
receive calls used by message libraries like MPI and PVM and and not those 
due to incorrect protocols. The algorithm not only works for statically speci- 
fied communication. It can also be applied when the sender/receiver is specified 
through an index into an array or by a function call. It is then the programmers 
job to go back and fix the array /function to return what the algorithm suggests. 
We assume that these errors are independent and infrequent. However, the ef- 
fectiveness of the technique decreases as the number of errors increases. It is 
straightforward in the case of one error but clearly of little use when the number 
of errors approaches the number of processes in the system. 

The majority of the paper is devoted to theoretically justifying the validity of 
using this algorithm for correcting errors. We use a counting argument to show 
that for less than n/2 errors, where n is the number of processes, the algorithm is 
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able to identify a few number of potential fixes. This demonstrates the usability 
of the algorithm for debugging these types of communication errors. 

This algorithm is a part of the Millipede debugging system [4] we are developing 
at the University of British Columbia. One tool of interest is a communication 
debugger that can extract the undelivered messages (and in some cases delivered 
messages) in the system and suggest corrective measures after the system has 
deadlocked. 

2 Description of Problem 

PVM and MPI are two widely used message passing systems which are used in 
parallel and cluster computing. The basic send and receive calls in PVM and 
MPI are as follows: 



send (buffer, receiver _nodeID, tag) 
recv (buffer, sender _nodeID, tag) 

Mistyping the nodelD or tag value results in a message that is either undelivered 
or a message received by the wrong process. 

For example, consider the simple case of a single error as shown in hg. 1. 



Process A Process B 



Send(buf, B, tl) -Recv(buf,A,tl) 



Recv(buf,B,t2) Send(buf,C,t2) 

: / : 



Fig. 1. Simple error. 



There is an error in the send call of process B in Fig. 1, B attempts to send 
a message to A but incorrectly sends it to C. Depending on whether the com- 
munication is synchronous or asynchronous process B either blocks, eventually 
hanging the system or terminates, but in either case it results in an undelivered 
message in the system. Millipede records a message history and makes it possible 
to extract both undelivered messages and recently delivered messages from the 
system. The same kind of error can occur if a tag value is incorrect. 
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3 The Algorithm 

Definition 1: Let S = (sq, si, • • • , Sn-i) be an ordered list of senders where 
each Si = (a,h) and a,h integer process identifiers (ranks in MPI). Let TZ = 
(ro, ri, . . . , r„_i) be an ordered list of receivers where each r,- = (a, h) and again 
a, h are process identifiers. For s,- = (a, h) ^ S, a is fixed by the ID of the sending 
process, and for r,- = (a, h) ^ TZ, h is fixed by the receiver. 

Definition 2: A match between a sender Si = (ap hi) and a receiver rj = [aj, hj) 
occurs when (a,- = a j) A (hi = hj). The opposite is called a mismatch. 

For the sake of simplicity we do not consider message tags or wild cards in this 
first analysis. We will however return to these cases later. 

The intuition behind the algorithm is as follows: 

Find a permutation of S denoted and a permutation of TZ denoted where 
the number of fields that need to be changed in order to obtain a system without 
any unmatched sends/receives is minimal. 

If 7Ts and 7Tr are such permutations it follows that after changing the required 
fields that for all Si G and rj G where [i = j) that ai = aj and hi = hj, 
which means that if the user changes his/her program accordingly the deadlock 
will disappear. 

The desired permutations can be obtained by computing a hamming distance 
between all possible combinations of permutations of senders and receivers, and 
choosing the one or ones that give rise to the smallest hamming distance. 
Unfortunately this algorithm has time complexity 0(n!). 

However, it is possible to reduce the problem to a bipartite matching problem [5]. 
The approach is as follows: 

Let G = (V , E) be a graph where 

— V = Vs U Vr where Id, and W both are sets representing all the process ids 
in the system. (Id, represents sending processes and W represents receiving 
processes). 

— id is constructed in the following way. 

• For all messages m not delivered (left in message queues) do the follow- 
ing: 

* If m = (s, r) is an outstanding send add edge (s, r) where s G U and 
r ^ Vr to E with capacity 2. 

* If m = (r, s) is an outstanding receive add edge (r, s) where s G U 
and r ^ Vr to E with capacity 2. 

• Iterate backwards through all successfully delivered messages (u,v) and 
add edges (u, v) and (v, u) with capacity 2 to id if no other edge exist in 
E that has m or as either source or destination. 

• Add edges with capacity 1 to id to make G a complete graph bipartite 
graph. 
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Now run the maximum bipartite graph matching algorithm, which uses flow- 
graphs to obtain a matching in G [?]. 

This matching results in a system without deadlocks. Furthermore this matching 
can be obtained by changing a minimum number of helds in the senders and 
receivers. 

The time complexity is OdG | • | f* |) where | /* | is the size of the matching. Since 
G is a complete graph \E \ = rv^ and | /*|= n. Therefore the time complexity is 
0{n^). 

We do not yet have a polynomial time algorithm for the case where tags are 
considered, but we believe that the bipartite graph matching problem can be 
adapted to cover this case as well. 

4 Algorithm Accuracy 

In this section we will evaluate the quality of the algorithm, i.e. we will validate 
that the algorithm does not frequently return an incorrect answer or more than 
one answer (There could me more than one way to hx a deadlock with a minimum 
number of held changes). To do this we need to introduce a model that precisely 
describes a system of senders and receivers equivalent to the one used in the 
previous section. 

In the following n denotes the number of senders and receivers, and k the number 
of errors in the system. We start out by dehning a few concepts. 

A commumcation configuration C is a pair (S,TZ) {S and TZ dehned in dehnition 
1). SR(n) is a set of all communication conhgurations with n senders and n 
receivers. 

A send Si = (ap hfi is called unmatched if for rb^ = [aj, hj), aj ai. Equivalently 
a receive rj = {aj, hj) is called unmatched if for = (ap hfi, hi hj We call a 
communication conhguration valid if it has no unmatched sends or receives. 
Given a conhguration {S, TZ) = ({so, • • • , Sn-i}, • • • , f*n-i}) in SR(n), Si = 
(ap&j) and rj = {aj,hj). The associated directed bipartite graph G = {V, E) is 
dehned by 




Figure 2 shows for a system with two senders and two receivers (SR(2)) the only 
two valid conhgurations. 

A valid communication conhguration v G SR(n) where Si = r,- Vi, 0 < i < n 
is called the correct conhguration (There is only one correct conhguration in 
SR(n)). 
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Fig. 2. All valid communication configurations in SR(2). 



In Fig. 2 the communication configuration vi is the correct communication con- 
hguration of SR(2). 

From now on the correct conhguration will be denoted Vc- 

Definition 3: B{v,i) is the set of communication conhgurations that can be 
created by moving i or less arcs from v, B{v,i) = B{v,i) \ B{v,i — 1) and 
B{v,0) =B{v,0) = {t;} 

Figure 3 shows B{vi, 1) for vi from Fig. 2. 





Fig. 3. B(vi, 1) C SR(2). 



Given any invalid communication conhguration r>. In Fig. 4 the boldface x marks 
V. The boxes mark valid conhgurations and the rest, marked by x, are other in- 
valid conhgurations. The solid line marks B{v, 0), the dashed line B{v, 1) and the 
dotted line B{v, 2). We wish to correct the invalid communication conhguration 
to be a valid communication conhguration by moving as few arrows as possible. 
This is equivalent to choosing the correct conhguration(s) with the smallest ham- 
ming distance to v and correcting the helds that do not match. This is done by 
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B(v,0) 




Fig. 4. Example of increasing B sets. 



choosing the hrst (one or more) valid communication conhguration(s) included 
in the series B{v, 1), B{v, 2), . . . . In the example in Fig. 4 a valid conhguration 
is found in B(v, 1). 

We will show that for any invalid communication conhguration in SR(n) the 
probability that the hrst encountered valid communication conhguration is the 
correct communication conhguration is high. In other words if we introduce k 
errors into a valid communication conhguration C then the algorithm will in most 
cases end up proposing C as the correct way to hx the erroneous system. Only 
in few cases will it suggest any of the other valid communication conhgurations 
in the corresponding system. 

Given SR(n) and a list of the valid conhgurations V = {r>o, • • • , i’n!-i}- Let k < j 
be the number of errors in the system. We need to consider the conhgurations 
obtainable by introducing one error to all Vi. This is a set of sets: 

Bi={B{vo,l),B{vi,l),...,B{v„,_i,l)} 

If we know for every system with one error that 

n!— 1 

n B{v,,i) = p &=0 

J=0 b e Bi 

the algorithm will always suggest a correct solution. Figure 5 illustrates this 
example. 

The goal is therefore to show the following: 

^ ^ and ^ ^ V e < fc, V Bi,Bj £ Be , {i j). 

\ -Oi \ I JDj I 

In other words small means an acceptable low fraction of wrongly proposed fixes. 
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Fig. 5. Example of overlapping B sets: 1) = 0 but r\,B{v,, 2) / 0 

We will see that this is equivalent to showing that 

^ ^ is small M e < k,\! Bi ^ {B{vi,e), . . . ,B{vn\-i,e)} (1) 

I ^0 \ 

where Bq = B{vc,e) where vq = Vc is the correct communication conhguration 
of SR(n). 

To easier handle these communication conhgurations we are going to introduce 
a short hand notation. For each communication conhguration in SR(n) (the size 
of SR(n) is we assign a 2n digit number (s,-, r,- G {0, . . . ,n— 1}) 

in the following way: 

Si equals the number of the receiver that sender number i is sending to, and r,- 
equals the number of the sender that receiver number i is trying to receive from. 

For example using the 2 conhgurations in Fig. 2 as an example we get the follow- 
ing short hand representation: vi = 0011 and r >2 = 1100. Figure 6 shows which 
conhgurations can be reached in k steps from the correct conhguration. 

We now proceed to proving a number of lemmas that will help prove (1). 
Lemma 1: The number of valid conhgurations in SR(n) is n\. 

Proof: For a conhguration to be valid each sender must send to a distinct 

receiver, and this receiver must receive from this sender. If s,- = j then rj = i. 
It is therefor sufficient to determine the number of different ways to order n 
senders. There are n\ such ways. □ 
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k=4 

k=3 




Fig. 6. Configurations that can be reached in k steps from the conhguration 0011. 
Lemma 2: The size of B{v, i) denoted | B{v, i) \ is 




Proof: 



I B{v,i) 



I U B{vJ) 1=^1 B{vJ) I = [n - l)-^' 

i=o i=o i=o ^ ^ 



□ 



Definition 4: 

The distance between two valid configurations in SR(n), denoted d{vi,Vj), is 
defined as: 



2n 

d(vi,Vj) = Y^[vi, 7 ^ Vj,] 
1 = 0 
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where Vi = Vi^Vi^ 



Va = Va, V 



Jl ^32 ■ 



■ 3,nd 



[‘Vi I 7 ^ ‘Vj,] 



0 : Vi, = Vj, 

1 : Vi, 2^ Vj, 



The valid conhgurations in SR(3) are 001122, 110022, 002211, 220011, 221100 
and 210210. Figure 7 shows the distances between the different valid conhgura- 
tions. 
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Fig. 7. Distances between valid configurations in SR(n). 

Lemma 3: For any SR(n), a distance k. and a valid configuration v the number 
of valid configurations in SR(n) with distance k does not depend on the choice 
of V. 

Proof: Permutations are automorphisms. □ 

Lemma 4: The possible distances between valid configurations in SR(n) are 
4, 6, . . . , 2n — 2, 2n. 

Proof: A necessary condition for a configuration to be valid is that {si = 

{ri,...,r„} = {0,...,n — 1}. Since all valid configurations are equivalent con- 
sider Vc = sir 2 S 2 r 2 , ■ ■ ■ , Snr„. A minimum of two send/receive pairs must be 
switched to obtain a difference valid configuration. This gives a minimum dis- 
tance of 4. Now choose two send/receive pairs a,h to switch. There are three 
cases to consider: 

1. Both pairs are of the form SiVi = ii, which means that either they have not 
been switched before or that they have been switched back to their original 
state. When these pairs are switched the distance increases by 4. 

2. One of the pairs, say a is of the form SiVi = ii, and the other one (b) is not. 
When a and h are switched a will contribute with distance 2 to the total 
distance and h already contributed with distance 2, so the total distance 
increases by 2. 

3. Neither a nor h are of the form SiVi = ii. Neither will contribute further to 
the total distance by being switched. 



□ 
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Let T>{v,k) be the set of valid configurations with exactly distance k from the 
valid conhguration v. 

Lemma 5: The size of V[v, m) for m = 2k is 

I V{v,2k) 1= 

where 




Proof: Let Vc be given. The number of valid conhgurations at distance 0 

from Vc is 1 (only Vc itself is distance 0 away!), therefore cq = 1. According to 
Lemma 3 no valid conhguration is distance 2 away, so ci = 0. For distance 4 
choose two of the n send/receives to move, this can be done in (!)) ways. The 
number of ways these two send/receives can be permuted is 21, but we must sub- 
tract the permutation that does not change anything (and results in Vc), i.e. 1, so 
we get (2) (21-1) = ( 2 )( 21 - (J)ci — (2)00). Set C2 equal to this expression which 
is the number of conhgurations with exactly two permuted send/receives giving 
distance 4. In general we must subtract the number of conhgurations that are 
at the wrong distance. For each distance m = 2k there are ()()fc! permutations, 
and again we must subtract all the permutations that do not exactly permute k 
send/receives, i.e. give the correct distance of 2k, so we get: 



I V{v, m) 




where 




Consider the following two conhgurations: vi = 001122 and V 2 = 002211. These 
two conhgurations differ in the last four positions, thus have a distance of 4. To 
computer the intersection B{vi, 2)r\B{v2, 2 ) we must hud the conhgurations that 
can be reached from both vi and V 2 by changing at most 2 positions in each. 
Since the distance between the two conhgurations is 4 and we may change at 
most 2 positions in each conhguration, it follows that we must change exactly 2 
in each. Choose 2 helds in vi, say vi^ and vi^. Change these two positions to have 
the values of r> 2 , and r> 2 j and obtain We know that d{v[, v 2 ) = 2. Now change 
the 2 positions in V 2 that differ from say V 2 , and r> 2 „ to have the values of 
vii = v'l^ and vi^ = v[^ and obtain v' 2 . We now know that d(v[,V 2 ) = 0. The 
original distance was 4 and we must change 2 helds in each conhguration. The 
the number of different ways this can be done is ( 2 ) =6. 
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The six configurations that can be obtained are: 

im33, 112222, 112^3, 1132#, 113232, 112#2. 

The underlined positions are the helds changed in vi and the overlined helds are 
the ones changed in r> 2 - 

According to Lemma 5 all valid conhgurations are equivalent. Therefore we can 
simply study the properties of the correct valid conhguration Vc of SR(n). We 
will now determine the number of elements in the intersections of the B sets. 

Theorem 1: Let e be the number of errors in a communication system. 

The number of conhgurations with e errors for which the algorithm will either 
suggest a wrong valid conhguration or a set of valid conhgurations where the 
correct one is included is 

e , , e e mm{ 2±tt2i_2i-a} 

[J B(vc,e)r\B(v,e) X] ^0(i,a,h,Co) 

v£V i = 2 6=0 a=max{b,e — b} Cq = 0 

where 

0{i, a, b, Co) = [c,y > OAc, > OAc, > 0] ~ (n-2)=- (n-l)=“ 

and 

^xy — ^o) T Cq) 2/ 

^x — 2/ Cl Cq 
Cy = 2i — b + Co 

Proof: Remember that 

e e 

B{vc,e)= B{vc,a) and B{vj ,e) = [^ B{vj ,b). 

a=0 6=0 

For a conhguration v G B{vc, a) H B(vj, b) with b = d[vj, v) and a = d{vc, v) if 
a < b then Vc will be reported as the correct communication conhguration. Since 
we are concerned with counting the valid communication conhgurations that are 
incorrect hxes we only consider cases where a > b. Additionally, if a + & < e then 
the intersection between B{vc,a) and B(vj,b) is empty. These 3 observations 
combined yield 

[J B{vc,e) n B{vi,e) = (J (J B{vc, e) Ci B{vj , e) 

v£V *G{4,...,2e} VjE.'D{vc,i) 

e e e 

= 'I2 12 [J[jB{vo,a)nB{vj,b) 

i = 2 vj£ 'D{vc,2i) b=0a=max{b,e — b} 



( 2 ) 
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We now calculate the value of 



B{vc, a) n B(vj , b) 



for values a and h where j = 2i. Let x = Vc = * 1*2 •••® 2 n and y = Vj = 
J/iJ /2 • • • J/ 2 n with d[x,y) = j = 2i. We want to hud conhgurations z = v = 
Z 1 Z 2 ■ ■ ■ Z 2 n such that d{x, z) = a and d{y, z) = h. WLOG assume that Xk = y^ 
for j + 1 < k < 2n. 



Now dehne 



j 'Zn-j 

^ 



Xl X2 ■ ■ ■ Xj 
Zl Z2 ■ ■ ■ Zj 

yi V2 ■■■ Vj 



Xj-\-\ • • • X2n 

... Z2n 

Uj + l • • • U'Jn 



C7y = 


= {zk : 


1 < 


k 


<j 


■ Zk = 


Xk 


A 


Zk 7^ 


Vk) 


Cx = 


= {zk : 


1 < 


k 


<j 


■ Zk 7 ^ 


Xk 


A 


Zk = 


Vk] 


Cxy - 


= {zk : 


1 < 


k 


<j 


: Zk 7 ^ 


Xk 


A 


Zk 7^ 


Vk] 


Cg -- 


= {zk : 


i + 


1 


< k 


< 2n 


Zk 


7^ 


Xk A 


Zk 7^ 


- 


= 1 a 


















Cy = 


~-\Cy 


















^xy - 


= 1 Cxy 


1 
















Co - 


= 1 Cg 



















A z-conhguration must satisfy 




since d{x, z) = a and d{y, z) = h. 



This gives us the following 3 equations. 



O, — Cxy T ty -f Cq 

b — Cxy T ^x T (3) 

j — ^xy T T Cy 

The last equation follows from the fact that Xk y^ for 1 < < J . 

If we solve with a' = a — Cg and b' = b — Cg we get the following matrix equation: 
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By inverting the matrix we can compute values for c^y, c^, and Cy\ 

/ 1 1-1\ /a'\ /a' + b'-j\ /a + b-j-2co 

-1 0 1 &' = j - a' = j-a + Co 

V 0-1 1/ Vi/ \ j-b' J \ j-h + Co 

where the number of helds where z differs from x but not from y is c^, the 
number of helds where z differs from y but not x is Cy, and the number of 
helds where z differs from both x and y is c^y- We now look at the different 
ways of choosing helds in z that satisfy these constraints. Of the j helds where 
Xk 7 ^ yk we must choose where z differs from x but not y. Of the remaining 
(i ~ Cx) we must choose Cy helds. The rest of the c^y helds are pre-chosen. Of the 
(2n—j) helds where x and y do not differ we must choose Cq helds where z differs. 

The values of the Cx and Cy helds are chosen to be the values of the opposite 
string, i.e. for the Cx chosen helds the value is that of y and vice versa for the 
Cy chosen ones. The remaining Cxy helds can take any value except those of the 
corresponding helds in x and y, so that leaves [n — 2) choices for these Cxy helds. 
The Co helds can take any value except that of the corresponding helds of x and 
y (which are the same!), i.e. there are [n — 1) different choices for these values. 

By substituting these values in equation (2) and summing over all valid values 
of Co (i.e. values for Co that produce non-negative values for Cxy,Cx and Cy) we 
get the number of conhgurations with distance a to * and distance b to y. The 
Iverson function [cj, > 0 A Cj, > 0 A Cxy > 0] assures valid values of Cxy, Cx and Cy. 
Taking into consideration that Co < and Co < 2i — a (See equation (3)) 

we arrive at 




e e e rnin{ ^2i-a} 

^\V(vc,2i)\^ ^ ^0(i,a,&,Co) = 

i = 2 6=0 a=max{b,e — b'\ Cq=0 

e / \ e e ,2i — a} 

^0(i,a,b,Co) (4) 

*' = 2 6=0 a=max{b,e — b} Cq = 0 

where Ci are the constants from Lemma 4, and 

0{i, a, b, Co) = [cxy > 0][c, > 0][c, > 0] ~ (n-2)=- (n-1) 

and Cxy, Cx, Cy are given as follows: 

^xy — {b Cg'j 2 / 

^x — 2 / Q, Cq 

Cy = 2i — h Co 
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1 Number of errors (e) 




e = 1 


e = 2 


e = 3 


e = 4 


e = 5 




Wrong 


Ambg. 


Wrong 


Ambg. 


Wrong 


Ambg. 


Wrong 


Ambg. 


Wrong 


Ambg. 


2 


0.00% 


0.00% 


0.00% 


54.55% 














3 


0.00% 


0.00% 


16.44% 


8.22% 














4 


0.00% 


0.00% 


10.83% 


2.17% 














5 


0.00% 


0.00% 


7.10% 


0.79% 


21.40% 


1.59% 










6 


0.00% 


0.00% 


4.91% 


0.35% 


14.83% 


0.75% 










7 


0.00% 


0.00% 


3.57% 


0.18% 


10.79% 


0.40% 


19.89% 


0.26% 






8 


0.00% 


0.00% 


2.70% 


0.10% 


8.17% 


0.23% 


13.51% 


0.11% 






9 


0.00% 


0.00% 


2.11% 


0.06% 


6.38% 


0.14% 


9.54% 


0.05% 


19.87% 


0.04% 


10 


0.00% 


0.00% 


1.70% 


0.04% 


5.12% 


0.09% 


6.97% 


0.03% 


13.62% 


0.02% 



Fig. 8. Failure rate for algorithm - Incorrect fixes and ambiguous fixes. 



□ 



Using equation (4) we can not compute an upper bound for the fraction (1). 
Figure 8 shows the estimated failure rate for the algorithm. An ambiguous fix is 
when more than one valid configuration is at the minimum distance. 



5 Message Tags 

We now briefly turn to message passing that includes message tags. We will not 
formally analyze this case, but briefly argue that the chances of the algorithm 
being able to predict the correct solution increases proportionally to the number 
of different message tags used. 

Figure 9 is a copy of Fig. 2 where we have introduced message tags. The and 
Ri both send/receive with tag 11, and S 2 and R 2 both send/receive with tag 22. 
It is obvious that V 2 is no longer a valid configuration as there is a tag mismatch 
between and R 2 and between S 2 and R\. In fact V 2 is now a configuration 
with at least 2 errors. 

By introducing message tags into a communication system, and by choosing 
them carefully, i.e. in a meaningful way with respect to the message they are 
associated with, the risks of the algorithm predicting a wrong solution is greatly 
reduced. 



























































































136 



Jan B. Pedersen and Alan Wagner 



As an example consider SR(2). As seen in the previous example the 54.55% 
ambiguity rate has disappeared as r >2 is no longer a valid conhguration. This 
holds true if 2 errors are introduced in the sender or receiver ids. The correct 
valid conhguration vi is distance 2 away where r >2 is distance 4 away. Similarly 
if 2 errors are introduced into the tags the correct valid conhguration will be 
distance 2 away where as the wrong valid conhguration will be distance 4 away. 




Fig. 9. Introducing message tags. 



Using wild cards is another challenging concept. The use of these is mainly for 
two reasons: dynamic communication or a shortcut for the programmer, i.e. used 
instead of a known process id. When introducing wild cards into a communi- 
cation system the degree of freedom with respect to held values increases. This 
decreases the success rate of an algorithm like the one presented. 



6 Conclusion 

We have presented an algorithm that proposes changes in deadlocked message 
passing systems that will correct the communication errors so the deadlock dis- 
appears. If a small number of errors occur in an otherwise working message 
passing system then we can correct these errors with a good probability. By 
carefully choosing message tags and by associating different tags with different 
types of communication the risks of wrong sends going through is substantially 
reduced. Furthermore and also the ability to predict the correct communication 
conhguration is greatly increased. 
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